MultiChord: A Resilient Namespace Management Protocol 


Nancy Lynch Ion Stoica 
MIT UC Berkeley 


Abstract 


MultiChord is a new variant of the Chord namespace management algorithm [7] that includes lightweight mech- 
anisms for accommodating a limited rate of change, specifically, process joins and failures. This paper describes the 
algorithm formally and evaluates its performance, using both simulation and analysis. Our main result is that lookups 
are provably correct—that is, each lookup returns results that are consistent with a hypothetical ideal system that dif- 
fers from the actual system only in entries corresponding to recent joins and failures—in the presence of a limited rate 
of change. In particular, if the number of joins and failures that occur during a given time interval in a given region 
of system are bounded, then all lookups are correct. A second result is a guaranteed upper bound for the latency of a 
lookup operation in the absence of any other lookups in the system. Finally, we establish a relationship between the 
deterministic assumptions of bounded joins and failures and the probabilistic assumptions (which are often used to 
model large scale networks). In particular, we derive a lower bound for the mean time between two violations of the 
deterministic assumptions in a steady state system where joins and failures are modeled by Poisson processes. 


1 Introduction 


This paper describes MultiChord, a new, more resilient variant of the Chord namespace management algorithm [7]. 
The main innovation is that MultiChord includes lightweight mechanisms for accommodating a limited rate of change, 
specifically, process joins and failures. 

The contributions of this paper include (a) techniques for improving the performance and resiliency of peer-to-peer 
namespace management algorithms, and (b) methods of analyzing performance for such algorithms in the presence of 
a bounded rate of change. 


Building in resiliency: We improve the performance and resiliency of Chord by adding additional entries to processes’ 
routing (finger) tables, and by delaying a process from joining until its finger table is properly populated. This demon- 
strates an approach to building peer-to-peer namespace management services in which resiliency to a bounded rate 
of change is built in from the beginning. The method we use is to design the ideal communication infrastructure 
with enough redundancy to accommodate a bounded rate of change without reducing latency, and to maintain this 
redundant structure using gossiping. Newly-joining processes should not participate fully in the system until they 
have been fully incorporated into the communication infrastructure. This general approach should extend to other 
communication infrastructures such as those proposed in [1, 4, 5, 6, 7]. 


Formal modeling and analysis: We present the algorithm precisely, using high-level, nondeterministic timed I/O 
automata pseudocode. We analyze its performance conditionally, assuming a limited rate of change. This demonstrates 
how peer-to-peer namespace management algorithms can be modeled using state machines and subjected to proofs 
and analysis. In particular, it demonstrates that interesting performance results can be obtained for such algorithms 
using conditional analysis, conditioned on the “normal case” assumption that changes happen at a bounded rate. This 
kind of analysis should be useful in comparing different namespace management algorithms. 

Our method of analysis is quite different from the probabilistic style used by Liben-Nowell et al [2]. Our claims 
are not probabilistic, but rather, worst-case bounds under restricted circumstances. Our assumptions about the rate of 
change are rather strong. However, as we discuss in Section 3, we can relax these assumptions by adding probabilistic 
assumptions, while still obtaining our stronger latency bounds. 


1.1 Overview 


The original Chord protocol [7] assumes a circular identifier space (called the Chord ring) of size N = 2”. With each 
process 7 is associated a unique logical identifier in this space. Each process 7 maintains a routing table (known as a 
finger table). The k-th entry in this table, called the k-th finger of process 7, contains a reference to the first process 
whose logical identifier follows process i’s logical identifier by at least 2" in the clockwise direction on the Chord 
ring, where 0 < & < n. In the remainder of this paper we refer to these fingers as the power-of-two fingers of 7. 
The successor of a logical identifier 7d represents the first process whose logical identifier follows zd in the clockwise 
direction on the Chord ring, or the process with logical identifier id if such a process exists. We redefine the notion of 
successor in the context of MultiChord in Section 1.2. 

In MultiChord, process 2 maintains, in addition to the finger table like that used in Chord, information about its 
“b-block” (i.e., its own b successors and b predecessors) and all b-blocks of its power-of-two fingers. The value of b 
is chosen based on an assumed upper bound on the “normal” rate of change. When the algorithm is in an ideal state, 
each process’ finger table contains its b-block, as well as a b-block for each of its power-of-two fingers. However, this 
information can degrade from an ideal state as a result of process joins and failures. 

MultiChord includes lightweight mechanisms, based on periodic background gossiping, for maintaining the system 
in a nearly ideal state in the face of limited change, i.e., limited joins and failures. Each process 7 continually sends 
its own b-block to its b successors and b predecessors, which allows them to update their finger tables. In addition, 
process 7 continually “pings” its power-of-two fingers, who respond by returning their own b-blocks. These periodic 
exchanges of information between a process and the processes in its finger table allow the system to gravitate back 
toward an ideal state in the face of changes. Like Chord, MultiChord does not differentiate between a process failure 
and departure. When a process 27 fails or leaves, processes who maintain process 2 in their finger tables will remove it 
when it expires. 

When a new process 7 joins the system, it first populates its finger table with its b-block, and the b-blocks of its 
power-of-two fingers. Like in Chord, a process 2 uses the lookup operation to find its power-of-two fingers. There 


are two other instances when a process 7 invokes a lookup: (i) when a client at location 7 explicitly invokes a lookup 
operation for a specified target, and (iii) when it decides to refresh its finger table. 

Like Chord, MultiChord implements the lookup operation in an iterative fashion. Consider a process 7 that per- 
forms a lookup on value z. At every iteration (stage), process 2 sends a query to the best known predecessor for «x. Let 
process k be this predecessor. Upon receiving the query, process k checks whether it knows the process responsible 
for x—that is, whether its immediate successor is responsible for s—, and if yes, it sends the answer back to process 2. 
Otherwise process k sends its best known predecessor for x to 7. MultiChord generalizes this procedure: at every stage 
process i sends c > 1 queries to the best known c predecessors for x. In turn, process k responds with its best known 
c predecessors of x. As we will show this redundancy increases the resilience of the lookup in the face of changes. 

The value of c is chosen to be larger than the number of changes that “normally” occur in a “small” interval of time, 
in a limited region of the ring. The length of this small interval of time is assumed to be sufficient for the system to 
recover from a limited number of changes in the relevant region of the ring. The admissible rate of change is quantified 
in Section 3. 


1.2 Notations 


the set of physical process identifiers (e.g., IP address and port number) 


XtoP, Ptox one-to-one correspondence from XJd to PId, and its inverse. 


the &*” successor of x in ring R 
( 


9 2 ) 
the k’” predecessor of x in R 
successor set of x; succset(x,k, R) = {succ(x,,R):0<£<k} 
proper successor set of x; psuccset(a,k, R) = {succ(a,£,R):1<€<k} 
predecessor set of x; predset(x,k, R) = {pred(a,l,R):0<¢<k} 
proper predecessor set of x; predset(x,k, R) = {pred(x,€,R):1<€<k} 
block (a, k, R) 


block of x; block(x,k, R) = succset(x,k, R) U predset(x,k, R) 


Table 1: Notations used in this paper. 


Table 1 shows the main notations used in this paper. Each process is identified by a physical identifier (e.g., IP 
address and port number), and a logical identifier in identifier space 0..2" — 1, where N = 2”. A ring R is a nonempty 
subset of logical identifiers (XJd), ordered in a clockwise direction. 

The k*” successor of x in R is denoted by succ(x, k, R). For k = 0, succ(a,0, R) = x if x € R, and is otherwise 
undefined. If k > 1 then succ(x,k, R) is the k*” value encountered when moving clockwise in R — x starting from the 
position of x, if |R — z| > k, and is otherwise undefined. The k*” predecessor of x is defined similarly (see Table 1). 


2 The MultiChord Protocol 


In this section we present the details of the MultiChord protocol. 


2.1 Process Automaton: Signature 


For the rest of this section, we fix a physical address i € PId, and describe the process automaton for location 7, 
MultiChord;. Throughout this section, we use me as an abbreviation for the general identifier G/d g with g.phys =1 
and g.log = PtoX (i), where g.phys and g.log denote the physical identifier, and the logical identifier of g, respec- 
tively. Formally, MulttChord; is a timed I/O automaton, as defined in Chapter 23 of [3]. 

The signature of MultiChord; is given in Figure 1. The external signature describes the inputs and outputs (primar- 
ily, client invocations and responses) by which the MultiChord service interacts with its environment. The external 
signature includes join, lookup and receive inputs and corresponding acknowledgments. We do not include special 
“leave” requests and responses in this paper; instead, we treat leaves as failures. We do not consider rejoining after a 


failure. The internal signature consists of transitions that implement join and lookup protocols, and maintain the finger 
tables in the face of a limited rate of change. 


Input: Internal: 
join(J);, J a finite subset of PId — {i} join-ping,; 
lookup(x);, 2 € XId neighbor-refresh ; 
receive(m);,;,m € Msg, j € PId chord-ping; 
stabilize(x);, 2 € XId 
Output: garbage-collect(f);, f € Finger 
join-ack; 
lookup-ack(H);, H C Gld Time-passage: 
send(m);,;,m € Msg, j € Pld time-passage(t), t € Rt 


Figure 1: MultiChord,; : Signature 


2.2 Process Automaton: Data Types and Constants 


Table 2 shows the data structures and the message formats used by the MultiChord protocol. In addition we define two 
operations on sets of fingers: 
1. update(F, F’), which computes F U F’; if a finger f belongs to both sets of fingers F and F" then f inherits 
the highest expiration time, exptime, that it has in the two sets. 
2. truncate(F,t), which bounds the exptime of each finger f € F tot, i.e., f.exptime := max(f.exptime, t). 


finger data structure; consists of fields: (gid € GId, exptime € R=° U {oo}) 


request identifier set, partitioned into subsets RegId(z), i € Pld; used to identify lookup instances 
JoinRecord used to keep track of progress in a process’ attempt to join the system; consists of 
et | fields: (regids C ReqId, comp C ReqId, acktime € R2° U {o0}) 
Client Record used to keep track of client-initiated lookup requests at a particular location; consists of fields: 
eee (reqids C ReqlId, comp C Reqld, acked C ReqId(i)) 
lookup response message, (tag = lookup-resp, id € RegId, stage € N", preds € Set|Finger]) 
ping message used to refresh finger information, (tag = ping, ping) 
message used to send a block to another message, (tag = block, block € Set[Finger]) 


the time between scheduling gossiping messages, i.e., PingMsg and BlockMsg messages 
the time from when a joining process has received all its responses until it responds to its client 


number of proper predecessors and successors that a process maintains about itself 
and its power-of-two fingers 


number of responses that a client returns in response to a lookup request; c < b 


Table 2: Data structures and message formats used in MultiChord. 


MultiChord uses only five types of messages: Lookup, LookupRespMsg and LookupRespCompletion to imple- 
ment join and lookup operations, and PingMsg and BlockMsg to maintain the finger tables in the face of changes. 

In addition, MultiChord uses the following time constants: (1) T,, the time between scheduled gossiping messages, 
(2) T., the timeout value for expiration of entries in the finger table, and (3) T;, the time from when a joining process 
has completed its systematic collection of responses until it responds to its client. 

Finally, MultiChord uses two constants b and c. Constant b represents the level of redundancy used by a process to 
maintain routing information. In particular, each process maintains its b proper successors and b proper predecessors, 
and b proper successors and b proper predecessors of each of its power-of-two fingers. Constant c represents the level 
of redundancy used to perform lookups, the basic operations in MultiChord. During lookup operations, each process 
issues c concurrent queries, which makes it highly likely that at least one process will respond. The value of c is chosen 


to be larger than the number of changes that are likely to occur in an arc of the ring, in intervals of some reasonable 
length. The length of this interval should be sufficiently long to allow recovery from recent changes. The value of b is 
usually larger than c; c must be large enough to ensure a response under “normal” conditions (with bounded changes), 
while b must be large enough to support the infrastructure maintenance protocol. 


2.3 Process Automaton: State 


The state of MultiChord; consists of the state variables listed in Figure 2. Note that our initializations of these 
variables assign tuples to record-valued variables. We use the convention that the order of the components in the 
tuples is the same as the order presented in the definitions of the record types. 


State variables: Derived variables: 

status € {idle, joining, active}, initially idle local-ring = {x € XId : Af € fingers|f.log = x]} 

join € JoinRecord, initially (0,0, co) For « € XId,k > 0: 

client € Client Record, initially (0, 6,0) f-succset(a,k) = {f € fingers: f.log € succset(x,k, local-ring)} 
used-reqids C Regld(i), initially i) f-psuccset(x,k) = {f € fingers : f.log € psuccset(ax, k, local-ring)} 
requests € Set| Request], initially 0 f-predset(a,k) = {f € fingers : f.log € predset(x, k, local-ring) } 
fingers € Set| Finger], initially {(me, 00) } f-ppredset(x,k) = {f € fingers : f.log € ppredset(x, k, local-ring) } 
out-queue, a sequence of Msg x Pld, initially empty f-block(x,k) = {f € fingers : f.log € block(«, k, local-ring)} 


ping-time € R2° U ov, initially 00 
nbr-refresh-time € R2° U oo, initially oo 
failed, a Boolean, initially false 


Figure 2: MultiChord, : State 

The status variable keeps track of the state of process 7. The join variable keeps track of the progress of the joining 
protocol for process i, and the client variable keeps track of the progress of all client-initiated lookups at location 7. 
The fingers variable contains a set of fingers, which represent process 7’s best knowledge of the current members 
of the ring (including their expiration times). The used-reqids variable keeps track of which request identifiers in 
ReqId(i) have already been used; it is used to model the generation of unique identifiers. The requests variable keeps 
track of the set of requests that have been initiated at location 7; these may be generated on behalf of the local joining 
protocol, local client lookup requests, or heavyweight stabilization. The out-queue variable is a buffer for messages 
that process 7 has generated and has not yet sent. 

The nbr-refresh-time and ping-time variables are used to schedule the gossip messages; nbr-refresh-time is 
used by process 2 to schedule sending of its own block to its nearby neighbors, whereas ping-time is used by process 
i to schedule “ping” messages to request block information from other processes. Finally, the failed variable is a flag 
saying whether process 2 has failed. 

Process 7 also maintains some derived variables, which also appear in Figure 2. The derived variable local-ring is 
defined to be the set of logical identifiers that appear in 7’s fingers variable, that is, local-ring represents 7’s current 
local view of the global ring. Other derived variables are defined to give various successor and predecessor sets, with 
respect to the local-ring. For example, f-succset(a, k) is defined to be the set of fingers in the current finger set whose 
logical identifiers are among the & successors of x in the current local-ring; if « appears in local-ring then this set 
include z itself. 


2.4 Process Automaton: Transitions 


In this section we present the main transitions in MultiChord. Section 2.4.1 describes the basic transitions such as 
message sending, garbage-collection, and time-passage transitions. Section 2.4.2 shows the transitions involved in the 
joining protocol, and Section 2.4.3 presents the transitions involved in the stabilization protocol. Finally, Section 2.4.4 
describes transitions involved in the client lookup protocol. 


2.4.1 Basic Transitions 


Figures 3(a)-(c) shows three basic transitions: sending, garbage-collection, and time-passage transitions. 

A send transition simply removes the first Msg from out-queue and sends it to the indicated destination, using an 
assumed point-to-point network. Process 2 can do this only if it has at least begun the protocol, and has not failed. 
A garbage-collect transition removes an entry from its fingers set when the entry’s exptime has been reached. A 


time-passage transitions advances the time until the next event, i.e., scheduling times of pinging, acknowledging 
the client, or neighbor-refreshing, and the exptime of any finger in the fingers set. Time may not pass at all if the 
out-queue is nonempty; this implies that messages in the out-queue are sent out immediately, without any time 
passage. 


Output send(m)¢i,;) Internal garbage-collect( f); time-passage(t) 
Precondition: Precondition: Precondition: 
afailed failed if failed then 
status # idle status = active now +t < ping-time 
(m, 7) = head(out-queue) f € fingers ROW: Se ea Gen Me 
Effect: f-exptime < now now +t < nbr-refresh-time 
remove head(out-queue) Effect: Vf € fingers : now +t < f.exptime 
— a out-queue is empty 
fingers := fingers — {f} yee 
(a) (b) now := now + *) 


Figure 3: (a) Sending transitions; (b) Garbage-collection transition; (c) Time passage transitions. 


2.4.2 Transitions Involved in the Joining Protocol 


Like Chord, in MultiChord a process uses lookups to populate its finger table when it joins the system. Where the two 
protocols differ is in the amount of state required to join the system. Whereas in Chord a process is required to know 
only a set of successor processes, in MultiChord a process is required to know a set of processes (i.e., a b-block) for 
each of its power-of-two fingers. As we will show in Section 3 this redundancy increases the resilience of the protocol 
in the face of changes. 

Next, we present the details of the transitions involved when process 2 joins the system. These include: 

1. The join, transitions, by which the client at location 7 requests to join. 

2. The receive transitions for lookup, lookup-resp, and lookup-comp messages, which are involved in initially 

populating process 2’s finger set. 
3. The join-ping transitions and the receive transitions for ping messages; these are used to complete the finger set 
before process 7 responds to the client. 

4. The join-ack, transitions, by which process 7 responds to its client. 

Figure 4 shows the join and join-ack transitions. In a join(./); transition, processor 7 initiates joining by submitting 
a set J of PIds of other processes that should already be members of the system. Process 7 handles the join request 
only if it has not failed and has not previously begun joining. To handle the join request, the process first sets its status 
to joining and schedules its ping task. If J = @, the process is already done and schedules its response to the client. 
Otherwise, if J 4 9, process 7 launches a set of lookup requests, one for itself and one for each of its power-of-two 
successors. 

When all these requests have completed, and when sufficient additional time has passed (as determined by a 
scheduled ack-time being reached), process 7 can report back to the client with a join-ack; transition. When it does 
so, it converts its status to active and schedules its nbr-refresh task. 


As in Chord, MultiChord implements an iterative lookup protocol. The processing of a lookup request involves 
three types of transitions, which appear in Figure 5. When process 7 receives a lookup message, it handles this message 
only if it is already active, that is, if it has completed its joining protocol. In order to handle the lookup message, it 
sends either a lookup-resp or a lookup-comp message, depending on whether it thinks that the search has reached its 
goal. The test for completion is that, according to 2’s current information, target x is among the c proper predecessors 
of the target. In the case of a lookup-comp message, process 7 sends back its best information about the target’s block 
of radius b. In the case of a lookup-resp message, process i sends back its c best proper predecessors for the target.! 

When process 7 receives a lookup-resp message for the current stage of a current request, it updates its finger 
table with the information contained in the preds field of the incoming message. Then because the request is not 
completed, process 7 generates a new batch of lookup messages for the next stage of the same request. This next stage 


'Tn either case, process i first truncates all fingers’ exptimes to now plus the maximum timeout value 7; ; this is because 2’s entry for itself has 
exptime = oo, but we do not want others to record exptime = oo for 2. 


Input join(J); Output join-ack; 


Effect: Precondition: 
if = failed then afailed 

if status = idle then status = joining 
status := joining join.reqids C join.comp 
ping-time := now join.acktime = now 
if J = 0 then join.acktime := now Effect: 
else status := active 

for x € {me.log} U {me.log + 2 :0<k<n—1}do nbr-refresh-time := now 


choose rid € RegId(i) — used-reqids 
used-reqids := used-reqids — {rid} 
join.regids := join.reqids U {rid} 
requests := requests U {(rid,1,x)} 
for 7 € J do 

add ((lookup, rid, 1, x), 7) to out-queue 


Figure 4: Client-level transitions related to joining 


Input receive(lookup, 7, 5,2) ;,i Input receive(lookup-comp, r, F’) ;,i 
Effect: Effect: 
if = failed then if sfailed then 
if status = active then new-fingers := {f CF: f.exptime > now}) 
if me.log € ppredset (x, c, local-ring) then fingers := update(fingers, new-fingers) 
block := truncate(f-block(me.log, b), now + Te) if r € join.reqids then 
add ((lookup-comp, 7, s, block), 7) to out-queue join.comp := join.comp U {r} 
else if join.regids C join.comp and join.acktime = oo then 
preds := truncate(f-ppredset(x,c),now + Te) join.acktime := now + T; 
add ((lookup-resp, 7, 8, preds), j) to out-queue if r € client.reqids then 


client.comp := client.comp U {r} 
Input receive(lookup-resp, r, 5, F’);,i 
Effect: 
if = failed then 
new-fingers := {f CF: f.exptime > now}) 
fingers := update (fingers, new-fingers) 
if dz[(r, 8,2) € requests)] then 
choose x where (r, 8,2) € requests 
requests := requests — {(r,s,x)}U{(r,s+1,x)} 
for f € f-ppredset(x, c) do 
add ((lookup,7,s + 1,2), f.gid.phys) to out-queue 


Figure 5: Transitions of the lookup protocol 


has the next-higher stage number, which is recorded in the request record. The messages for the new stage are sent 
to the c currently-known best proper predecessors of the target. Note that the number of messages does not increase 
exponentially at each stage; the protocol limits the the number of messages to c. 

When process i receives a lookup-comp message for the current stage of a current request, it updates its finger table 
with the information in the block field of the incoming message. As in the lookup-resp case, process 7 increments the 
request’s stage number, to register the fact that some response for this stage has arrived. If the current request is part of 
a’s joining protocol, then the completion of this request is recorded in the join record; if this represents the completion 
of the last request, then process 2 also schedules the client acknowledgment. On the other hand, if the request is being 
done on behalf of a client-initiated lookup, the completion is recorded in the client record (see Appendix A). 

During the joining protocol, process 7 periodically pings its power-of-two fingers for their b-blocks. The relevant 
transitions are the join-ping transitions and the receive transitions for ping messages and their responses (see Figure 6). 

Process 7 performs a join-ping transition while it is joining, whenever ping-time is reached. When it does so, 
it sends ping messages to the c-blocks of all targets for which lookup requests have already completed. This allows 
process 2 to augment and refresh its information about completed requests while finishing the joining protocol. When 
process 7 receives a ping message, it responds by sending back its b-block, in a block message. When process i receives 
a block message, it updates its finger table with the new information. 


Internal join-ping; Input receive(ping) ;,i 


Precondition: Effect: 

afailed if failed then 

status = joining if status = active then 

ping-time = now block := truncate(f-block(me.log,b), now + Te) 
Effect: add ((block, block), 7) to out-queue 

for r € requests where r.id € join.comp do 

for f € f-block(r.target,c) do Input receive(block, F’);,; 
add ((ping), f.gid.phys) to out-queue Effect: 
ping-tame := now + Ty if failed then 


if status = active then 
new-fingers := {f € F': f.exptime > now} 
fingers := update (fingers, new-fingers) 


Figure 6: Transitions related to pinging during the join protocol 


2.4.3 Transitions Involved in Stabilization 


Once process 2 is active, it performs several types of transitions to maintain its finger table. The protocol includes two 
kinds of stabilization: normal case, lightweight stabilization, and a heavier-weight stabilization. 

In the lightweight stabilization protocol, process 7 periodically sends its b-block to its nearby neighbors (the mem- 
bers of its b-block), and periodically pings processes in the vicinity of its power-of-two successors, so that they send 
i their current b-blocks. The transitions involved in this lightweight stabilization protocol are the neighbor-refresh 
transitions, the chord-ping transitions, and the receive transitions for ping and block messages. Note that the pseu- 
docode for ping and block transitions has already been presented in Figure 6, while the pseudocode neighbor-refresh 
and chord-ping transitions appears in Figure 7(a)-(b). 


Internal neighbor-refresh ; Internal chord-ping; Internal stabilize(); 

Precondition: Precondition: Precondition: 
afailed afailed afailed 
status = active status = active status = active 
nbr-refresh-time = now ping-time = now Effect: 

Effect: Effect: choose rid € RegId(i) — used-reqids 
for f € f-block(me.log, b) do fork € {0,...,n—1}do used-reqids := used-reqids — {rid} 

add ((block, f-block(me.log, b)), for f € f-block(me.log + 2*,c) do requests := requests U {(rid,1,x)} 
f.gid.phys) to out-queue add ((ping), f.gid.phys) to out-queue for f € f-ppredset(a,c) do 
nbr-refresh-time := now + Ty ping-time := now + Ty add ((lookup, rid, 1, x), f.gid.phys) 
(a) (b) to OU AEC sy 


Figure 7: Transition related to stabilization. 


In the heavyweight stabilization protocol is similar to the Chord stabilization protocol (see Figure 7(c)). Process 4 
(for any reason, unspecified here) may try to obtain new information about any target z. Most commonly, such a target 
will be one of its power-of-two successors. For example, process i might execute stabilize(z); for each x of the form 
PtoX (i) + 2*, at regular intervals, or when it suspects that its information is out-of-date. 


2.4.4 Transitions Related to Client Lookups 


The transitions related to client-initiated lookup operations include the receive transitions already described, plus the 
lookup and lookup-ack transitions. These last two appear in Figure 8. 

When process 2 receives a client-initiated lookup request, it handles it in much the same way it handles a request 
in the joining protocol. Namely, it chooses and records a request identifier, and sends a lookup message to each of 
the c best proper predecessors it knows for the target identifier. An exception: If process 2 believes it is one of the c 
best predecessors, it does not bother sending out any lookup messages, but simply records the fact that the lookup is 
done. A lookup-ack can occur when a request is done but not yet acknowledged to the client. In this case, the response 
includes information about process 2’s current c best predecessors for the target. 


Input lookup(z); Output lookup-ack(H); 


Effect: Precondition: 
if = failed then afailed 
if status = active then r € requests 
choose rid € RegId(i) — used-regids r.id € client.comp — client.acked 
used-reqids := used-reqids — {rid} H =({f.gid: f € f-ppredset(r.target, c)} 
if me.log € ppredset (x, c, local-ring) then Effect: 
client.comp := client.comp U {rid} client.acked := client.acked U {r.id} 
else 


client.reqids := client.reqids U {rid} 
requests := requests U {(rid,1,x)} 
for f € f-ppredset(x,c) do 
add ((lookup, rid, 1, x), f.gid.phys) to out-queue 


Figure 8: Transitions for client lookup 


3 Summary of Analysis Results 


In this section we give a short and informal summary of our analysis results. Appendix A presents the proofs of these 
results. 

We make the following assumptions about the environment: (1) all processes are time-synchronized, (2) the mes- 
sage delay is bounded above by d, and there is no message loss, (3) during an interval of time T, + 2d, the number of 
join-ack events among processes in an “arc” of the ring containing at most b + 1 processes is at most joinbd, and (4) 
during an interval of time T,, the number of failed processes in an “arc” of the ring containing at most b + 1 processes 
is at most failbd. 

Then we show that if these assumptions hold, and furthermore, if the following constraints are satisfied: 

1. T; > T, + 2d and T, > 5(T, + 2d) 

2. e> Tjoinbd + 4failbd 

3. b > 2c + 3joinbd + max(2joinbd, failbd) 
we prove that all lookup operations are correct. In particular we prove the following result: 


Theorem 3.1. Every good execution a satisfies 2T., + 6d-lookup-correctness. 


The notion of e-Lookup-correctness is defined as follows: suppose that a lookup-ack(H); event occurs in / at time 
t, in response to a prior lookup();. Let 6’ be the portion of 8 ending with the given lookup-ack(H); event. Then 
there exists a ring R such that: 


1. RC aug-ring(G’), 
2. global-ring(6') — {PToX (J) : join-ack; occurs ata time >t —e} C R, and 
3. H = ppredset(x,c, R). 


Furthermore, we show that in the absence of any other events in the system the lookup latency is bounded. More 
formally, we prove the following result: 


Theorem 3.2. Suppose that a is a good execution, a’ a finite prefix of a containing at least 2c + 1 join-ack events. 
Suppose that: 


1. The final step of a! is a lookup; step in which 1 initiates request r, with target x. 
2. No other requests (on behalf of joins, client lookups, or stabilizes) are active at any time > ¢time(a’) — Te. 
Then request r terminates with a receive(lookup-comp) step, at a time that is < €time(a') + 4(log N + 1)d. 


In order to prove these results, in Appendix A we first prove a series of results asserting that the basic routing 
infrastructure is maintained correctly by the joining and refresh protocols. 

While the deterministic assumptions on the bounded number of joins (joinbd) and failures (failbd) allow us to 
prove strong analytical results, these assumptions are not always realistic. We consider this issue in Appendix B, 
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Figure 9: (a) The lookup failure versus the rate of change; (b) the average path length and the 90-th percent confidence 
interval as a function of change rate. 


where we give bounds on the probability that these assumptions hold in a steady state system in which processes join 
according to a Poisson process and have a lifetime drawn from an exponential distribution. In particular, we compute 
the mean time between two violations of these assumptions as 


(c/3—XT; (b+1))? 


eo e. PTOFD (1) 


where X represents the normalized rate of change (1.e., the rate of change in the entire system divided by the number 
of processes N in the system), c/3 > AT;(b+ 1), and 6 > 13c/6. 


4 Simulation Results 


In this section we evaluate our algorithm by simulation. Our goal is twofold. First, we want to get a sense of how 
much we can push the protocol in practice before it breaks, i.e., before we start to see lookup failures. Second, we 
want to see how the protocol performs on the average case. We use the average number of stages in a lookup as the 
metric to evaluate the performance of MultiChord. 

We have developed an event driven simulator that accurately implements the protocol at the message level. In all 
simulations, we use T, = 10 sec, T; = 11 sec, and T, = 55 sec. The message propagation delay is bounded by d = 50 
ms. Note that these values satisfy the constraints presented in Section 3, ic. T; > T, + 2d and T. > 5(T, + 2d). 
Each process schedules heavy stabilization every 60 sec. 

We consider a network with 1, 000 processes, in which processes join at a rate A, according to a Poisson process, 
and have an exponentially distributed lifetime with the mean N/,; thus, the number of processes in the system 
remains roughly the same. In addition, we assume that the system receives lookups at a rate approximately 10 times 
larger than the join and failure rate, r. 

Figure 9(a) plots the lookup failure rate versus the arrival rate of new processes in the system (i.e., rate of join) 
over 10, 000 lookups. During each simulation there are approximately 1000 new processes that join the system, and 
1000 processes that fail. We consider two cases: (i) c = 2, b = 5, and (ii) c = 4, b = 9. As expected, the rate of 
lookup failure increases as the join rate increases. However, increasing the level of redundancy (i.e., parameters b and 
c) makes a significant difference. While in case (i) we did not record any lookup failure for join rates less or equal to 
0.1, in case (ii) we did not see any lookup failure for a join rate five time larger, i.e., 0.5. Furthermore, for a join rate 
of 2.0 the rate of lookup failure in the first case is about 18 times larger than in the second case. 

It is interesting to compare the simulation results with our upper bound on the mean time 7’; between two violations 
of the deterministic constraints. Consider the first case where c = 2 and b = 5. Using Eq. (1), for a join rate of 0.5 
we obtain Ty = 14 ms.” This is a very small value given the fact that a lookup operation is generated every 50 ms 
(i.e., there are roughly 10 lookups for every join operation). One explanation for this large discrepancy is that a single 
constraint violation will hurt only a small fraction of lookups, if at all. Indeed, the lookups that do not use the region 
of network where the constraints are violated will not be affected. 


2Here we usec = 2,0 =5,A = 1/1000 (there is one join and one failure every 0.5 sec on average and N = 1000), and Tj = 11sec. 


Figure 9(b) plots the average number of stages (path length) of a lookup versus the rate of join for (i) c = 2,b = 5, 
and (11) c = 4,b = 9, respectively. There are two points worth noting. First, the average path length is significantly 
smaller than in Chord; in Chord, the expected path length is log N/2, which in our case translates to 5 hops. This is 
because in MultiChord every process maintains a much larger set of fingers than in Chord. This increases the chance 
that a MultiChord process will know fingers closer to the target than an equivalent Chord process, which ultimately 
will reduce the number of lookup stages. Second, as the join rate increases, the lookup path length decreases slightly. 
To understand this recall that in steady state the average life time of anode is N/A,q where X, is the join rate. However, 
it takes a process at least T, time to join the system. Thus a node will be inactive for at least f = T,Aa/N of its life 
time, which means that at least f N processes in the system would be inactive on an average. As the join rate increases, 
the fraction f of inactive nodes increases, which will lead to a corresponding reduction in the number of active nodes 
in the system. A secondary reason is that as the join rate increases so does the failure rate. Since we do not report 
failed lookups, and since the failed lookups tend to have more stages, the reported path length is an underestimation. 


5 Conclusions and Future Work 


In this paper we present MultiChord, a namespace management algorithm based on Chord [7]. MultiChord uses 
redundancy and lightweight mechanisms to accommodate limited changes in time and space. We analyze MultiChord 
and show that lookups are guaranteed to be successful and furthermore that the lookup latency is bounded. 

It would be interesting to analyze the behavior of the algorithm in situations that are less well-behaved than what 
we have described in this paper. In particular, we plan to consider what happens if the rate of change exceeds our 
assumed bound for some part of the execution, but at some point “stabilizes” to obey the rate bound. In such cases, 
we believe that our algorithm will eventually stabilize to a nearly-ideal state. It remains to determine if this is so and 
determine bounds on how long this might take. 
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6 Appendix A: Analysis 


In this appendix we prove the results which were summarized in Section 3. 

Let f by a finite sequence of external actions of MultiChord, according to the external signature just defined. Then 
we define the global ring after 3, global-ring (3), to be the set of XIds x such that a join-ackx 4, p/,) event occurs in B 
and no fail ¢5p(~) occurs in 3. That is, the global ring after 3 consists of those processes that have completed joining 
the system and have not failed. We extend this same definition to finite executions of untimed or timed automata that 
have the given external signature. 

If ( is a finite timed sequence of actions in the MultiChord external signature, then we define the augmented ring 
after 8, aug-ring(3), to be global-ring(Z) UX, where X is the set of X7ds x such that fail x7,p(2) occurs in 8 ata 
time > ¢time(G) — T.. That is, aug-ring(a) augments global-ring(G) by adding in the logical identifiers of recently 
failed processes. Again, we extend this definition to finite executions of timed automata that have the given external 
signature. 


6.1 Service Guarantees 

We describe safety and latency guarantees. We do not present any liveness guarantees here, replacing them with 
latency guarantees. 

6.1.1 Safety 


The following condition is simple a well-formedness condition, expressing basic conditions such as “the service re- 
sponds only to invocations that were actually made”. 


e Well-formedness: For each i, at most one join-ack, occurs in 3. Any join-ack, in { is preceded by a join(«);. 
Any lookup-ack, is preceded by a lookup(*); with no intervening lookup-ack(«);. If fail; occurs in 6, then no 
following outputs occur. 


We have not formulated any interesting safety guarantees related to joining. For client lookup, we require the 
following property, parameterized by e € R29: 


e e-Lookup-correctness: Suppose that a lookup-ack(H); event occurs in f at time t¢, in response to a prior 
lookup(a);. Let 6’ be the portion of @ ending with the given lookup-ack(H); event. Then there exists a 
ring R such that: 


1. RC aug-ring(('), 
2. global-ring(8') — {PToX (7) : join-ack; occurs at atime >t —e} C R, and 
3. H = ppredset(x,c, R). 


6.1.2 Latency 
As noted above, we replace liveness claims by latency bounds: 


e e-Join-latency: Suppose that a join(.J); event occurs in (, at time t. 


1. If J = @ then a corresponding join-ack; occurs at time ¢. 


2. If there exists 7 € J such that join-ack; occurs before the join(.J);, and neither 7 nor j fails in 8, then a 
corresponding join-ack; occurs by time ¢ + e. 


e e-Lookup-latency: If a lookup(a); event occurs in @ at time ¢ and no fail; occurs in 3, then a corresponding 
lookup-ack(«); occurs by time ¢ + e. 
6.2 Assumptions for Analysis 


In this section we formalize the algorithm constraints and the assumptions about the environment, which we discussed 
in Section 3 
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6.2.1 Restrictions on the algorithm 

Constraints on values of the constants 6, c, d, T,, Te, and T;: 
e T; > T, + 2d 
e T. > 5(T, + 2d) 

Scheduling assumptions: 


e The locally controlled actions that are enabled are performed without any intervening time-passage. 


6.2.2 Restrictions on the environment 


Constants: 
For the purpose of analysis, we introduce two constants, joinbd and failbd. We assume: 


e c> Tjoinbd + 4failbd 
e b> 2c + djoinbd 
e b> 2c+ 3joinbd + failbd 
Restrictions on timing and failures: 
e No message loss. 
e No time passes while a locally-controlled action is enabled. 


e Bounded local joins: An execution a satisfies bounded local joins provided that for any finite prefix a’ of a, the 
following holds. 
Let x,y € XId where |global-ring(a') N [x, y]| < 6+ 1. Then the number of join-ack, events that occur in a’ 
at times > ftime(a') — (T, + 2d), where PToX (k) € [x,y], is < joinbd. 
That is, at any point in the execution a, the number of recent join-ack events among processes in an “arc” of the 
ring containing at most b + 1 processes is at most joinbd. 


This assumption is not ideal because it is expressed in terms of the number of join-ack events, which are under 
the control of the algorithm (rather than the environment). We could justify this assumption in terms of a more 
primitive assumption that bounds the rate of join events, which are controlled by the environment. To do this, 
we might need to modify the algorithm so that it schedules the join-acks so that (in the normal case) they occur 
a fixed amount of time after the joins. Alternatively, a probabilistic justification might be possible. 


e Bounded local failures: An execution a satisfies bounded local failures provided that for any finite prefix a’ of 
a, the following holds. 
Let x,y € XId where |global-ring(a') N [x,y]| < 6+ 1. Then the number of fail, events that occur in a’ at 
times > ¢time(a’') — T., where PToX (k) € [x,y], is < failbd. 
That is, at any point in a, the number of recent fail events among processes in an arc of the ring containing at 
most b + 1 processes is at most failbd. 


We also need a special assumption to ensure that there are “enough” processes in the ring. 
e Enough-processes An execution @ satisfies enough-processes provided that it has a finite prefix a’ such that: 


1. At least 2b + 1 join-ack events occur in a’. 
2. No fail event occurs in a’. 


3. In any state of a after a’, the total number of live processes is always > 2b + 1. 


We call the shortest such prefix a’ the initialization prefix. 


A good execution is one that observes all the timing and failure restrictions given in this section. 
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6.3 Basic Lemmas 
The first lemma says that exptimes of fingers are always > now. 


Lemma 6.1. The following is true in any state that is reachable in a good execution: 
Tf f € fingers, then f.exptime > now. 


The next lemma says that every physical identifier 7 that appears in another process’ fingers set, or in a message 
in transit, must correspond to a process whose status is active. 


Lemma 6.2. The following is true in any state that is reachable in a good execution: 
Suppose that f € Finger, f.phys = i, and any of the following holds: 


1. f € fingers; for some j F i. 

2. f € m.block for some m € BlockMsg that is in transit. 

3. f € m.preds for some m € LookupResp in transit. 

4. f €m.block for some m € LookupComp in transit. 
Then status; = active. 


The next lemma says that, if a process fails at a time t, then no expiration time for that process that is greater than 
t+ T, ever appears anywhere in the state. 


Lemma 6.3. Suppose that a is a finite execution, and fail; occurs at time t in a. Suppose that f € Finger and 
f-phys = i. Suppose that, in ¢state(a), any of the following holds: 


1. f € fingers ; for some j F i. 

2. f € m.block for some m € BlockMsg that is in transit. 

3. f € m.preds for some m € LookupResp in transit. 

4. f €m.block for some m € LookupComp in transit. 
Then f.exptime <t+ Te. 


As a corollary to some of the previous lemmas, the following lemma says that a process that has failed more than 
T. time ago does not appear in anyone’s fingers set. 


Lemma 6.4. Suppose that a is a finite execution, and fail; occurs strictly before time ltime(a) — T, in a. Suppose 
that f € Finger and f.phys = i. Then in €state(a), f does not appear in fingers ; for any j F i. 


Proof. By contradiction. Suppose that in ¢state(a), f € fingers, for a particular 7 # 7. Then by Lemma 6.3, in 
éstate(a), f.exptime < t+ T., where t is the time at which fail; occurs. Lemma 6.1 implies that, in ¢state(a), 
f.exptime > now, that is, f.exptime > ltime(a). These two inequalities together imply that £time(a) < t+ Ty. 
This contradicts the hypothesis that fail; occurs strictly before time ¢time(a) — T.. 


6.4 Maintaining Neighbor Sets 


In this section, we prove that the neighbor sets are properly maintained. We divide the work into three steps: First, 
we consider what happens when there are no failures and only a bounded number of joins. Second, we consider the 
general case, with unlimited failures and joins. 

The results we prove express knowledge guarantees for live processes. Specifically, we show that all live processes 
always know about all neighbors that joined more than time 27, + 5d ago. Moreover, after a process has been live for 
sufficiently long, it knows about all neighbors that joined more than time d ago. 

Breaking the proof up in some such way seems necessary in order to make the proof tractable. Each stage intro- 
duces its own new difficulties: the first stage already includes many of the issues involving the timing of the flow of 
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information during and soon after the joining protocol. The second stage introduces issues of local knowledge—each 
process maintains information about its local neighborhood only. The third stage introduces the complications of 
failures, which mean that a process cannot rely on responses from any particular other process. 

We expect this decomposition to be useful in constructing the general proof, because the ideas of the first stages 
should be useful in the later stages. Also, the result for the first stage should be directly usable in proving the more 
general results, in describing properties of the initial set-up phase. 


6.4.1 Basic lemmas 


The following lemma says that every block message contains a high expiration time for the sender. 
The mention of a deadline for a message in transit refers to a detailed state-machine model for a timed channel, in 
which a deadline is explicitly kept for each message. This deadline is described in terms of absolute time. 


Lemma 6.5. Let a be a good finite execution. If a block message is in transit from i with deadline £ then it contains a 
finger for i with exptime > €+ T, —d. 


6.4.2 No failures, limited joins 


In the case we consider in this subsection, no processes fail and at most 2c + 1 join-acks occur. With this limited 
number of join-acks, every process is in every other process’ c-block, so we do not have to worry about issues of local 
knowledge. 

The following lemma says that everyone “always” has a finger for 79, with a “sufficiently high” expiration time. 
The precise statement of this is rather complicated, because many different cases are covered. 


Lemma 6.6. Let a be a good finite execution that contains no fail events, and contains at least one and at most 2c+ 1 
join-ack events. Let ig denote the process that performs the first join-ack in a. Let i € Pld. 
Then in tstate(a): 


1. If status; = joining and a@ contains a receive(lookup-comp),,; event for target PToX (i), then there exists 
f € fingers, with f.phys = io such that: 
(a) One of the following holds: 
i. f.exptime > ping-time, + 2d. 
ii. There is a ping message in out-queue, addressed to ig and f.exptime > now + 2d. 
iii. There is a ping message in transit from i to ig with deadline € and f.exptime > €+ d. 
iv. There is a block message in out-queue;, addressed toi, and f.exptime > now + d. 
v. There is a block message in transit from ig to i with deadline £, and f.exptime > £. 
(b) f.exptime > now. 
2. If status; = joining and a contains a receive(block);,,; event, then there exists f € fingers, with f.phys = io 
such that: 
(a) One of the following holds: 
i. f.exptime > ping-time, + 2T, + 7d. 
ii. There is a ping message in out-queue, addressed to ig and f.exptime > now + 2T, + 7d. 
ili. There is a ping message in transit from 1 to ig with deadline £ and f.exptime > €+ 2T, + 6d. 
iv. There is a block message in out-queue,, addressed toi, and f.exptime > now + 2T, + 6d. 
v. There is a block message in transit from 19 to 1 with deadline £, and f.exptime > € + 2T, + 5d. 
(b) f.exptime > now + 2T, + 5d. 


3. If status; = live then there exists f € fingers, with f.phys = to such that: 


(a) One of the following holds: 
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i. f.exptime > nbr-refresh-time, + 2T, + 5d. 
ii. There is a block message in out-queue, addressed to io, and f.exptime > now + 2T, + 5d. 
iii. There is a block message in transit from 1 to 19 with deadline € and f.exptime > € + 2T, + 4d. 
iv. There is a finger fori in fingers ;, with exptime > nbr-refresh-time;,, and f.exptime > nbr-refresh-time;, + 
Ty, + 4d. 
v. There is a block message in out-queue,, addressed toi, and f.exptime > now + Ty + 4d. 
vi. There is a block message in transit from ig to 1 with deadline ¢, and f.exptime > €+ Ty + 3d. 


(b) f.exptime > now + T, + 3d. 


4. If a block or lookup-comp message is in an out-queue then it contains a finger for ig with exptime > now + 
T, + 3d. 


5. Ifa block or lookup-comp message is in transit with deadline £ then it contains a finger for ig with exptime > 
€+ 7, + 2d. 


Proof. We proceed by induction on the number of steps in a following the join-ack,, . 

Base: 0 steps. 

Then the last step of a is join-ack;,. All the conditions are easy to check. 

Inductive step: The only actions that could falsify any of the claims are receive(lookup), send(lookup-comp), receive(lookup-comp), 
join-ping, send(ping),.;,, receive(ping).,;,, send(block), receive(block), join-ack, neighbor-refresh, v, and garbage-collect. 

We consider cases. 


1. receive(lookup).. ;. 


This has the potential to falsify Property 4, in the case where a lookup-comp message is placed in out-queue,. 
By inductive hypothesis, Property 3(b), in the pre-state of the final transition, there exists f € fingers, such that 
f.phys = % and f.exptime > now + T, + 3d. Therefore, if a lookup-comp message is placed in out-queue; 
as a result of this transition, it contains a finger for ig with exptime > now + T, + 3d. This shows Property 4. 


2. send(lookup-comp);, ; 


This could falsify Property 5. In the pre-state of the final transition, a lookup-comp message is in out-queue;. 
Therefore, by inductive hypothesis, Property 4, this message contains a finger for 19 with exptime > now + 
T, + 3d. Since £ < now + d, we have exptime > €+ TT, + 2d, as needed for Property 5. 


3. receive(lookup-comp).. i. 


This could falsify Property 1. Before the step, a lookup-comp message is in transit to 2 with deadline > now. By 
inductive hypothesis, Property 5, this message contains a finger for ig with exptime > now + T, + 2d. So after 
the step, fingers, contains a finger f for ig with f.exptime > now + T, + 2d. Since ping-time, < now + Ty, 
we have that f.exptime > ping-time, + 2d. This shows both parts of Property 1. 


4. join-ping;. 
This could falsify Property 1(a) or 2(a). For Property 1(a), suppose that status; = joining and a contains a 
receive(lookup-comp),,; event for target PToX (7). The interesting case is where la(i) is true just before the 


step, that is, fingers, contains a finger f for ig with f.exptime > ping-time, + 2d. Since ping-time,; > now, 
this implies that f.exptime > now + 2d. This inequality is true after the step as well. 


We claim that the step results in a ping message addressed to 79 being placed in out-queue,;; this means that 
la(ii) is satisfied in the post-state, as needed. Since we have assumed that Property 1(a)i is true in the pre-state, 
we know that fingers, contains a finger for io in the pre-state. Since a receive(lookup-comp) occurs in a for 
target PToX (i), we know that there exists r such that r.id € join.comp, and r.target = PToX (i). Therefore, 
the join-ping deposits ping messages addressed to its entire c-block, according to its local ring. This includes 29, 
as needed. 


For Property 2(a), the argument is similar to that for Property 1(a). This time, suppose that status; = joining 
and a contains a receive(block),.,; event. The interesting case is where 2a(i) is true just before the step, that is, 
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fingers, contains a finger f for 29 with f.exptime > ping-time, + 2T, + 7d. Since ping-time, > now, this 
implies that f.exptime > now + 2T, + 7d. This inequality is true after the step as well. 


We claim that the step results in a ping message addressed to ig being placed in out-queue,; this means that 
2(a)ii is satisfied in the post-state, as needed. Since we have assumed that Property 2(a)i is true in the pre-state, 
we know that fingers, contains a finger for ig in the pre-state. Since a receive(lookup-comp) occurs in a for 
target PToX (i), we know that there exists r such that r.id € join.comp, and r.target = PToX (i). Therefore, 
the join-ping deposits ping messages addressed to its entire c-block, according to its local ring. This includes 29, 
as needed. 


. send(ping);,i9- 

This could falsify Property l(a) or 2(a). For Property 1(a), suppose that status; = joining and a contains a 
receive(lookup-comp).,; event for target PToX (i). The interesting case is where la(ii) is true just before the 
step, that is, fingers, contains a finger f for io with f.erptime > now + 2d and there is a ping message in 
out-queue, addressed to zg. After the step, there is a ping message in transit from 7 to 9 with deadline now + d. 
Taking £ = now + d, we see that Ic is true after the step. 


For Property 2(a), the argument is similar: 2(a)ii before the step implies 2(a)iii after the step. 


. receive(ping); i,- 

This could falsify Property 1(a) or 2(a). For Property 1(a), suppose that status; = joining and a contains a 
receive(lookup-comp),,; event for target PToX (i). The interesting case is where la(iii) is true just before the 
step, that is, fingers, contains a finger f for ig with f.exptime > €+ d and there is a ping message in transit 
from 7 to 79 with deadline @. Since £ > now, we have that f.erptime > now +d. After the step, there is a 
block message in out-queue,, addressed to 7. Therefore, la(iv) is true just after the step. 


For Property 2(a), the argument is similar: 2a(iii) before the step implies 2a(iv) after the step. 


. send(block); 4. 


This could falsify Property 1(a), 2(a), 3(a), or 5. For Property 1(a), the interesting case is where j = io, k = i, 
and 1(a)iv is true before the step, that is, fingers, contains a finger f for 79 with f.exptime > now + d. Since 
now + d > £, we have that f.exptime > £, so that la(v) holds after the step. 


For Property 2(a), the interesting case is where 7 = to, k = 7, and 2(a)iv holds before the step. Then, arguing as 
in the previous case, 2(a)v holds after the step. 


For Property 3(a), there are two interesting cases. The first is where j7 = 1, k = ig, and 3(a)ii holds before the 
step; in this case 3(a)iii holds after the step. The second case is where 7 = io, k = i, and 3(a)v holds before the 
step; in this case 3(a)vi holds after the step. 


For Property 5, we use Property 4 in the pre-state to show Property 5 in the post-state. 


. receive(block) ; 4 
This could falsify Property 1(a), 2(a), or 3(a). 


For Property 1(a), the interesting case is where 7 = igo, k = i, and Property 1(a)v holds before the step. Then 
by Lemma 6.5, the received message contains a finger for i9 with exptime > now + T, — d. By assumptions 
on the constants, the right-hand side is > 47, + 7d, so exptime > now + 4T, + 7d. Therefore, after the step, 
fingers, contains a finger for 79 with exptime > now + 4T, + 7d > ping-time, + 2d. Thus, 1(a)i is satisfied 
after the step. 

For Property 2(a), the interesting case is where 7 = 79, k = i, and Property 2(a)v holds before the step. Arguing 


as in the previous case, we see that after the step, fingers, contains a finger for 79 with exptime > 4T, + 7d > 
ping-time, + 2T, + 7d. Thus, 2(a)i is satisfied after the step. 


For Property 3(a), there are two interesting cases. The first is where 7 = i, k = ig, and 3(a)iii is satisfied before 
the step; then we claim that 3(a)iv holds after the step. The argument for this uses Lemma 6.5, applied to 7. The 
second case is where j = ig, k = i, and 3a(vi) is satisfied before the step; in this case, 3(a)i holds after the step. 
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9. join-ack;. 


This could falsify Property 3(a). By inductive hypothesis, Property 2(b), in the pre-state, fingers, contains a 
finger for 29 with exptime > now + 2T, + 5d. Since nbr-refresh-time, = now right after the step, 3(a)i holds 
after the step. 


10. neighbor-refresh,. 
This could falsify Property 3(a). The interesting case is where Property 3a(i) holds in the pre-state. The step 
puts a block message in out-queue,; addressed to i9. Then 3a(ii) holds in the post-state. 

11. v(t) 


This could falsify Property 1, 2, 3, or 4. For Property 1, there are two interesting cases. The first is where 1(a)iv 
holds in the pre-state. But then time cannot pass, by our timing assumption (no time passes while an out-queue 
is nonempty). The second possibility is that we might falsify 1(b). However, note that 1(b) follows from 1(a). 
Similar arguments hold for Properties 2, 3, and 4. 


12. garbage-collect. 


Since in every case, the finger whose existence is claimed has exptime > now, it cannot be garbage-collected. 
Therefore, garbage-collect cannot falsify any of the claims. 


Next, we describe knowledge that 29 acquires about the other processes. 


Lemma 6.7. Let a be a good finite execution that contains no fail events, and contains at least one and at most 2c+ 1 
join-ack events. Let ig denote the process that performs the first join-ack in a. Let i € Pld be such that join-ack, 
occurs in a at time t. 

Then in tstate(a), one of the following holds: 


1. t = now and a block message addressed to ig is in out-queue;. 
2. A block message is in transit from 1 to ig with deadline t + d. 
3. fingers;, contains a finger f for i such that one of the following holds: 
(a) f.exptime > nbr-refresh-time, + T. — Ty. 
(b) A block message addressed to ig is in out-queue, and f.exptime > now + T, — Ty. 
(c) A block message is in transit from i to ig with deadline £ and f .exptime > €+T. — (TI, +d). 
Proof. By induction on the number of steps in a following the join-ack;. 
Base: 0 steps. 
Then the last step of a is join-ack;. Then we claim that Property 1 holds in the post-state. This follows because in the 
pre-state, ¢ has a finger for 29, by Lemma 6.6, part 3(b). 
Inductive step: The only actions that could falsify the claim are send(block);, receive(block);,, neighbor-refresh,, 
time-passage, and garbage-collect,, . 
1. send(block); 
This could falsify Property 1 or 3(b). However, if it does so, it makes Property 2 or 3(c) (respectively) true. 


2. receive(block);, 
Lemma 6.5 implies that after the step, fingers;,, contains a finger for 7 with exptime > €+ TT. — d, where ¢ 
is the deadline component of the received message. Since £ = nbr-refresh-time, — T, + d, (the sending time 
plus d), this implies that this finger has exptime > nbr-refresh-time, —T, +d+T, — d, that is, exptime > 
nbr-refresh-time, + T. — T,, which shows that 3(a) is satisfied after the step. 

3. neighbor-refresh, 


This could falsify Property 3(a); however, if it does so then Property 3(b) holds after the step. 
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4. v(t) 
This could falsify Property 1 or 3(b). However, if 1 or 3(b) holds in the pre-state, then time cannot pass, by our 
timing assumptions, because an out-queue is nonempty. 

5. garbage-collect,,. 


Because T, > T, + d, the expiration times of the claimed fingers are all strictly greater than 0. Therefore, this 
cannot falsify any of the statements. 


The following corollary summarizes the conclusions of Lemma 6.7, saying that 79 has a finger for any other process 1 
that has joined at least time d ago, with a high expiration time. Also, any block message that is sent by zo sufficiently 
long after 2 joins contains a finger for 2 with a high expiration time. 


Corollary 6.8. Let a be a good finite execution that contains no fail events, and contains at least one and at most 
2c + 1 join-ack events. Let i9 denote the process that performs the first join-ack in a. Let i € PId be such that 
join-ack; occurs in q at time t. 

Then in tstate(a), the following hold: 


1. Ift +d < now then fingers;, contains a finger f for i such that f.exptime > now + T, — (T, + d). 


2. Ift + 2d < £and a block message is in transit from ig with deadline £, then the message contains a finger for % 
such that f.exptime > €+ T. — (TI, + 2d). 


The next lemma gives guarantees about what an arbitrary process 2 knows about another arbitrary process 7. This 
represents “second-order” information, because 2 may need to learn this information indirectly, through 7. 


Lemma 6.9. Let a be a good finite execution that contains no fail events, and contains at least one and at most 2c+ 1 
join-ack events. Let ig denote the process that performs the first join-ack in a. Let s = €state(a). Then: 


1. Suppose that s.status; = joining and a contains a receive(block);,,; event. Suppose that join-ack; occurs in a 
ata time < £time(a) — (T, + 3d). 
Then s. fingers, contains a finger f for j such that f.exptime > s.now+T,. — (2T, + 3d). 


2. Suppose that s.status; = active and join-ack, occurs in a at a time > ltime(a) — (T, + 2d). Suppose that 
join-ack; occurs in a at a time < ¢time(a) — (2T, + 5d). Then s.fingers; contains a finger f for j such that 
f-exptime > s.now+T,. — (8T, + 5d). 


3. Suppose that s.status; = active and join-ack, occurs in a at a time < time(a) — (LT, + 2d). Suppose that 
join-ack; occurs in a ata time < time(a) — (Ty + 2d). Then s.fingers; contains a finger f for j such that 
f-exptime > s.now+T,. — (2T, + 2d). 


The proofs are based on conveying information through 79. These proofs are not inductive; rather, they rest directly on 
previously-proved lemmas. 


Proof. 1. Assume that s.status; = joining and a contains a receive(block) ;,,; event. Also suppose that join-ack; 
occurs in q@ at a time < ftime(a) — (T, + 3d). 


Lemma 6.6, Part 1(b), implies that whenever 2 sends a ping message during its joining protocol, it has a finger 
for ig. Thus, by the limitation on the number of join-ack events, 7 is included in the set of destinations of the 
ping message. 

We claim that, in a, process 7 receives a block message from zg sent by io in response to a ping message sent by 
i ata time > s.now — (I, + 2d). For if not, then the latest block message received by i from ig is a response to 
a ping sent by i at a time < s.now — (T, + 2d). But then it must be that another ping message is sent by i at a 
time < s.now — 2d, and this receives a response by the end of a, a contradiction. 


Since the time of the join-ack; is < s.now — (Ty, + 3d), it must be < s’.now —d, where s' is the state just before 
io sends this block message. Therefore, by Corollary 6.8, Part 1, in state s’, fingers ;, contains a finger for 7 
with erptime > s'.now + T. — (I, + d). Therefore, in state s, which is at most time T, + 2d later, fingers; 
contains a finger for j with exptime > s.now + T. — (2T, + 3d), as needed. 
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2. Suppose that s.status; = active and join-ack, occurs in a at atime > ftime(a) — (T, + 2d). Also suppose that 
join-ack,; occurs in q@ at a time < ftime(a) — (2T, + 5d). 


By the inductive hypothesis, Part 1, we know that, just before the join-ack,, fingers, contains a finger for j with 
exptime > now + T. — (2T, + 3d). Therefore, in state s, which is at most time T, + 2d later, fingers, contains 
a finger for j with exptime > s.now + T, — (37, + 5d), as needed. 


3. Suppose that s.status; = active and join-ack, occurs in a at atime < ¢time(a) — (T, + 2d). Suppose that 
join-ack,; occurs in a at a time < ¢time(a) — (Ty + 2d). 


Corollary 6.8, Part 1, implies that in any state s’ of a with s’.now > t' +d, fingers,, contains a finger for 7 
with exptime > s'.now + T. — (I, + d). Since T, > T, + d and because of the limitation on the number of 
join-ack events, 7 is included in the set of destinations of every block message sent by Zo in such a state s’. 


We claim that, in a, process i receives a block message from i sent by ig at a time > s.now — (IT, + d). For 
if not, then the latest block message received by 7 from ig is sent by ip at a time < s.now — (T, +d). But then 
it must be that another block message is sent by ig (as part of a neighbor-refresh,,) at a time < s.now — d, and 
this arrives at 2 by the end of a, a contradiction. 


Now fix s’ to be the state just before io sends this block message; thus, s’.now > s.now — (T, +d). Putting 
this inequality together with the assumption that the join-ack; occurs at a time < s.now — (T, + 2d), we may 
conclude that the join-ack; occurs at a time < s’.now — d. Therefore, by Corollary 6.8, Part 1, in state s’, 
fingers, contains a finger for j with exptime > s'.now + T, —(T, +d). Therefore, in state s, which is at most 
time T, + d later, fingers, contains a finger for j with exptime > s.now + T. — (2T, + 2d), as needed. 


The following lemma describes information that 7 is guaranteed to have after receiving a lookup-comp message. It rep- 
resents “third-order” information, because the lookup-comp message could be conveying “‘second-order” information 
from its sender. 


Lemma 6.10. Let a be a good finite execution that contains no fail events, and contains at least one and at most 2c+1 
join-ack events. Let i,j € PId. Suppose that status; = joining and a contains a receive(lookup-comp),.,; event for 
target PToX (i). Suppose that join-ack; occurs in a, ata time < Ctime(a) — (3T, + 8d). 

Then in €state(a), fingers, contains a finger for j with exptime > now. 


Proof. (Sketch:) If the time when process i receives the lookup-comp message is < ftime(a) — (T, + 2d), then i 
also receives a block message from ig before the end of a. In this case the result follows from Lemma 6.9, Part 1. 

On the other hand, if the time when process i receives the lookup-comp message is > €time(a) — (T, + 2d), then 
the result follows from Lemma 6.9, part 2, applied to the sender of the message. In applying this lemma, we add time 
T, + 3d (d for the message delay and T, + 2d for the time that might have elapsed from the receive(lookup-comp)) 
to the age of the known processes and subtract this from the expiration time of the finger. This uses the fact that 
Te > 4T, + 9d. 


The next series of results bound how long it takes for a process 7 to become an “authority”, like 29. That is, it knows 
about all processes that have joined more than time d ago. The first case is where another process j joins sufficiently 
long after 2 so that 7 knows about 2 at the point where it joins. 


Lemma 6.11. Let a be a good finite execution that contains no fail events, and contains at least one and at most 
2c + 1 join-ack events. Suppose that join-ack; and join-ack; occur in a at times t and t', respectively, and where 
t+T,+3d<t' <now-d. 

Then in Estate(a), fingers, contains a finger for j with exptime > now + T. — (T, + d). 


Proof. We first claim that, at any point in a after the join-ack,, fingers ; contains a finger for 2 with exptime > now. 
Lemma 6.9, Part 1, implies that, in the state immediately before the join-ack ;, fingers; contains a finger for 7 with 
exptime > now + T. — (2T, + 4d). Thereafter in a, through time t' + T, + 2d, fingers; contains a finger for 7 with 
exptime > now + T. — (3T, + 6d). Also, at any time after t' + T, + 2d in a, Lemma 6.9, Part 2 implies that fingers ; 
contains a finger for i with exptime > now + T. — (3T, + 6d). Combining these two facts, we conclude that, at any 


time after the join-ack,, fingers ; contains a finger for i with exptime > now + T. — (3T, + 6d) > now. 
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Immediately after the join-ack,, and at intervals of T, thereafter, process j performs a neighbor-refresh ;, in which 
it sends a block message containing a finger for itself with exptime = T.. By the argument in the previous paragraph, 
zis included in the destination set of each such block message. At the end of a, some such message must have arrived 
at 7 which was sent by j at a time > ftime(a) — (IT, + d). Therefore, in fstate(a), fingers, contains a finger for j 
with exptime > now + T. — (I, + 2d), as needed. 


The second case is where 2 and 7 both join long enough before the end of the execution. 


Lemma 6.12. Let a be a good finite execution that contains no fail events, and contains at least one and at most 
2c + 1 join-ack events. Suppose that join-ack; and join-ack; occur in a at times t and t', respectively, where t,t! < 
Ltime(a) — (2T, + 3d). 

Then in fstate(a), fingers, contains a finger for j with exptime > now + T. — (T, + d). 


Proof. By Corollary 6.8, Part 1, by time strictly less than ftime(a) — (2T, + 2d), fingers,, contains fingers for 
both 4 and j, each with exptime > now + T. — (T, + 2d). Then by time strictly less than ftime(a) — (I, + d), 
j teceives a block message from Zo telling 7 about 7, resulting in fingers; containing a finger for 7, with exptime > 
now + T. — (2T, + 3d). And then by time strictly less than ¢time(q), i receives a block message directly from j 


telling 2 about j, and producing the needed finger. 


The following corollary says that if process 7 has joined more than time 37), + 6d ago, it is an “authority”, in the sense 
that it knows about all processes j that has joined more than time d ago. 


Corollary 6.13. Let a be a good finite execution that contains no fail events, and contains at least one and at most 
2c + 1 join-ack events. Suppose that join-ack; and join-ack; occur in a. at times t and t', respectively, where t < 
ltime(a) — (3T, + 6d) and t' < ltime(a) — d. 

Then in Estate(a), fingers, contains a finger for j with exptime > now + T. — (IT, + d). 


Proof. This follows from the two previous lemmas. 


6.4.3 Joins and failures 


Now we use the ideas in the previous section to talk about what happens when we have unlimited joins and also 
failures. Now, instead of relying on 79 as an “authority”, processes rely on neighbors that happen to have been around 
long enough. Because of the failures, we now consider the augmented ring as well as the actual global ring. 

From now on, I am being slightly sloppy by writing just 7 instead of PToX (i) in many places. This is done for the 
sake of readability. I hope it does not cause any confusion. The first lemma relates various neighborhoods in the same 
ring. 


Lemma 6.14. Let R be any ring, i,7,k € Pld. 
1. If j € block(i, e,, R) andk € block(t,e2, R), then j € block(k, e1 + e2, R). 


2. Ifj € succset(i,e,, R), k € succset(i,e2, R), andk ¢ succset(i,e3, R), thenj € block(k, max (e, — e3, €2), R). 


Proof. Straightforward. 


The following lemma asserts the existence of neighbors that have joined a long time ago. 


Lemma 6.15. Assume that c > e, + e2 + e3joinbd. Let a be a good finite execution, R = global-ring(a). Let 
i € Pld. Suppose that |R| > c+ 1. Then: 


1. There exists k € PId such that 


(a) k € succset(i,c — e1, R). 
(b) k ¢ succset(i, e2, R). 
(c) join-ack, occurs ata time < £time(a) — e3(T, + 2d) 


(d) fail, does not occur in a. 
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2. There exists k € PId such that 


(a) k € predset(i,c — e1, R). 
(b) k ¢ predset(i, es, R). 
(c) join-ack, occurs ata time < £time(a) — e3(T, + 2d) 


(d) fail, does not occur in a. 


Proof. We prove Part 1; Part 2 is analogous. There are at least c — (e, + e€2) processes in the set difference 
succset(i,c — e1, R) — succset(i,e2,R). Of these, at most e3joinbd perform a join-ack at times > ¢time(a) — 
e3(T, + 2d). Since c > e; +e2 +e3joinbd, it must be that at least one of these processes, call it k, performs a join-ack 
at atime < ltime(a) — e3(T, + 2d). This k satisfies all the listed properties. 


The next lemma relates neighborhoods in the global ring to neighborhoods in the augmented ring. 
Lemma 6.16. Let a be a good finite execution, e € N, i,j € Pld. 
1. If j € psuccset(i, e, global-ring(a)) then j € psuccset(i,e + failbd, aug-ring(a)). 
2. If j € ppredset(i, e, global-ring(a)) then j € ppredset(i,e + failbd, aug-ring(a)). 
3. If j € succset(i, e, global-ring(a)) then j € succset(i,e + failbd, aug-ring(a)). 
4. If j € predset(i, e, global-ring(a)) then j € predset(i,e + failbd, aug-ring(a)). 
5. If j € block(i, e, global-ring(a)) then j € block (i, e + failbd, aug-ring(a)). 


Proof. (Sketch) These follow because at most failbd processes in the given region appear in aug-ring(q) but not in 
global-ring (a). 


The next lemma says that neighbors in the augmented ring are also neighbors in the local ring. 

Lemma 6.17. Let a be a good finite execution, s = €state(a). 
1. If j € succset(i, e, aug-ring(a)) and fingers, contains a finger for j, then j € succset(i, e, s.local-ring;). 
2. If} € predset(i,e, aug-ring(a)) and fingers, contains a finger for j, then j € predset(i, e, s.local-ring;). 
3. If j € block(i, e, aug-ring(a)) and fingers, contains a finger for j, then j € block(i, e, s.local-ring;). 


Proof. We show Part 1; the rest are similar. If 7 ¢ succset(i,e, s.local-ring,), then it must be that there are at least 
e elements of s.local-ring,) in the interval (i, j). But each of these is an element of aug-ring(a)), which contradicts 
the assumption that 7 € succset(i,e, aug-ring(a)). 


The next lemma relates the augmented ring at some point to the global ring at a point not too far in the past. 


Lemma 6.18. Let a be a good finite execution, a a prefix of a with ltime(a') > ltime(a)—T.. Ifi € global-ring(a’), 
then i € aug-ring(q). 


Proof. By the definition of aug-ring. 


The next lemma says that a neighbor in the augmented ring at a particular time is a neighbor in the global ring at a 
point not too far in the past. 


Lemma 6.19. Let a be a good finite execution, a’ a prefix of a with €time(a') > ltime(a) — T.. Let e € N and 
i,7 € Pld. Suppose j € global-ring(a’'). Then: 


1. If j € psuccset(i, e, aug-ring(a)) then 7 € psuccset(i, e, global-ring(a’)). 


2. If j € ppredset(i, e, aug-ring(a)) then j € ppredset(i, e, global-ring(a’)). 
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3. If j € succset(i, e, aug-ring(a)) then j € predset(i, e, global-ring(a’)). 
4. If j € predset(i, e, aug-ring(a)) then j € predset(i, e, global-ring(a’)). 
5. If j € block(i, e, aug-ring(a)) then j € block (i, e, global-ring(a’)). 


Proof. For Part 1, suppose for the sake of contradiction that 7 ¢ psuccset(i, e, global-ring(a')). Then | global-ring(a')N 
(i, 7)| > e, that is, there are more than e elements of global-ring(a') in the interval properly between i and j, 
moving in the clockwise direction. By Lemma 6.18, every such element is also in aug-ring(a’). Therefore, j ¢ 
psuccset(i, e, aug-ring(a)). This is a contradiction. 

The proof of Part 2 is analogous. For Part 3, suppose that 7 € succset(i, e, aug-ring(a)). If j € psuccset(i, e, aug-ring(a)) 
then the conclusion follows from Part 1. The only remaining case is where 7 = 2, but this case follows trivially from 
the fact that 7 € global-ring(a’). 

Part 4 is analogous. Part 5 follows from Parts 3 and 4. 


The following lemma summarizes facts about the knowledge of a new process at various points during and soon after 
its joining protocol. 


Lemma 6.20. Let a be a good finite execution, s = €state(a). Let i be a process that does not fail in a. Then: 


1. Suppose that s.status; = joining and a receive(lookup-comp),,; event for target i occurs in @ at a time > 
time(a) — (Ty + 2d). Suppose that j € block(i, c, aug-ring(a)), join-ack; occurs in a ata time < time(a) — 
(83T, + 8d), and fail; does not occur in a. 

Then fingers, contains a finger for j with exptime > now. 


2. Suppose that status; = joining and a receive(lookup-comp)..; event for target i occurs in @ at a time < 
time(a) — (Ty +2d). Suppose that j € block(i, b, aug-ring(a)), join-ack; occurs in a ata time < ltime(a) — 
(T, + 3d), and fail; does not occur in a. 

Then fingers, contains a finger for j with exptime > now + T. — (2T, + 3d). 


3. Suppose that s.status; = active and a join-ack, occurs in a at a time > ltime(a) — (Ty + 2d). Suppose that 
j € block(i, b, aug-ring(a)), join-ack; occurs in a ata time < ttime(a) — (2T, + 5d), and fail; does not occur 
in a. 
Then s. fingers, contains a finger for j with exptime > s.now + T. — (3T, + 6d). 


4. Suppose that s.status; = active and a join-ack,; occurs in a at a time < Ltime(a) — (I, + 2d). Suppose that 
j € block(i,b, aug-ring(a)), join-ack; occurs in a ata time < ttime(a) — (Ty + 2d), and fail; does not occur 
in a. 
Then s. fingers, contains a finger for j with exptime > s.now +T. — (2T, + 2d). 


5. Suppose that s.status; = active and a join-ack, occurs in a at a time < ltime(a) — (3T, + 6d). Suppose that 
j € block(i, b — failbd, aug-ring(a)), join-ack; occurs in a at a time < ¢time(a) — d, and fail; does not occur 
in a. 
Then s. fingers, contains a finger for j with exptime > s.now+T, — (IT, +d). 


Proof. Let R denote global-ring (a). The proof is by strong induction on the number of steps in a. 
Base: The total number of join-ack events in a is at most 2c + 1. 

If there are no join-ack events in a then the statements are all vacuously true. If there are between one and 2c + 1 
join-ack events in qa then the five claims follow from Lemma 6.10, Lemma 6.9, Parts 1, 2, and 3, and Corollary 6.13, 
respectively. (This uses the fact that, in the absence of failures, aug-ring is the same as global-ring.) 

Inductive step: We assume that a contains more than 2c + 1 join-ack events. We assume that the result is true for all 
proper prefixes of a and show it for a. We show the five properties in turn. 


1. For Part 1, suppose that s.status; = joining, a receive(lookup-comp),.,; event for target 7 occurs in a at a time 
> ltime(a) — (T, + 2d), and fail, does not occur in a. Also suppose that 7 € block(i, c, aug-ring(a)) and 
join-ack, occurs at a time < ftime(a) — (3T, + 8d). We must show that s. fingers; contains a finger for j with 
positive exptime. 
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Consider the first lookup-comp message for target 7 that is received by 7, and let k be the sender of this message. 
Let a’ be the prefix of a ending just before the receive(response) step in which k sends this message, let 
s' = fstate(a’) and let R' = global-ring(a’). 

By inductive hypothesis, Parts 3 and 4, s’. fingers, contains a finger for every process in succset(k, b, aug-ring(a’)) 
whose join-ack event occurs at a time < ftime(a') — (2T, + 5d) and that does not fail in a’. Therefore, by 
Lemma 6.16, s’.fingers,, contains a finger for every process in succset(k, b — failbd, R') whose join-ack event 
occurs at a time < ftime(a’) — (2T, + 5d). In particular, s’ fingers, contains a finger for every process in 
succset(k, min (|R’ NM [k, i)|,6 — failbd), R’) whose join-ack event occurs at a time < time(a’) — (2T, + 5d). 
By our assumption on the join rate, at most 3joinbd processes in succset(k, b — failbd, R') perform join-ack 
events at times > £time(a')—(2T,+5d). It follows that s’ fingers, contains at least min (|R' 1M [k, i)|, b — failbd) — 
3joinbd fingers for processes in R! NM [k, 7). 

Now we claim that & € ppredset(i,c + 3joinbd,R'). If not, then |R’ M [k,i)| > c+ 3joinbd. Then, 
since b > c + 3joinbd + failbd, we have that min (|R'N [k,i)|,b— failbd) — 3j0inbd > c, which im- 
plies that s’ fingers, contains strictly more than c fingers for processes in R’ M [k,i). This implies that 
k ¢ ppredset(i,c, s'.local-ring;), However, the definition of the receive(response) transitions implies that 
k € ppredset(i, c, s'.local-ring ;), which yields a contradiction. Therefore, k € ppredset(i, c+ 3joinbd, R'), as 
claimed. 

Since j € block(i,c, aug-ring(a)), Lemma 6.19 implies that 7 € block(i,c, R’). Since 7 € block(i,c, R’) and 
k € predset(i, c+ 3joinbd, R'), Lemma 6.14 implies that 7 € block(k, 2c+ 3joinbd, R’). Since (by assumption 
on constants) b > 2c + 3joinbd + failbd, we have that j € block(k, b — failbd, R'). Therefore, by Lemma 6.16, 
j € block(k, b, aug-ring(a’)). 

Now we use the inductive hypothesis, Parts 3 and 4, again, to conclude that s’ fingers, contains a finger for 7 
with exptime > s'.now + T. — (3T, + 6d). To apply the inductive hypothesis, we need the fact that join-ack, 
occurs at a time < ¢time(a') — (2T, + 5d); this follows from our assumption that join-ack; occurs at a time 
< ltime(a) — (37, + 8d) and the fact that £time(a’) > ltime(a) — (T, + 3d). 

Since j € block(k,b, aug-ring(a')) and s’ fingers, contains a finger for 7, Lemma 6.17 implies that 7 € 
block (k, b, s' .local-ring;,). Therefore, this finger for j gets included in the block sent by & in the lookup-comp 
message. 


Upon receipt of this message, fingers, contains a finger for j with exptime > now + T, — (3T, + 7d). Then at 
the end of a, at most time T, + 2d later, fingers, contains a finger for j with exptime > s.now+T,.—(4T, +9d). 
Since T, > 47, + 9d, this implies exptime > s.now, as needed. 


. For Part 2, suppose that status; = joining and a receive(lookup-comp),,; event for target 4 occurs in a at a 
time < Ctime(a) — (T, + 2d). Suppose that j € block(i,b, aug-ring(a)), join-ack; occurs in @ at a time 
< ltime(a) — (T, + 3d), and fail; does not occur in a. We must show that fingers, contains a finger for j with 
exptime > now + T. — (2T, + 3d). Without loss of generality, assume that j € succset(i, b, aug-ring(a)). 


We first claim that there exists k € Pld such that k € succset(i,c — 2failbd, R) — succset(i, 2failbd, R), 
join-ack, occurs at a time < ftime(a) — (4T, + 10d), and fail, does not occur in a. This follows from 
Lemma 6.15, applied with e; = e2 = 2joinbd and e3 = 5, using the assumption that c > 5joinbd + 4failbd. 


Now we claim that j € block(k, b—2failbd, aug-ring(a)). We know that k € succset(i,c—failbd, aug-ring(a))). 
Also, since k ¢ succset(i, 2failbd, R), we have that k ¢ succset(i, 2failbd, aug-ring(a)). Also, by assump- 
tion, 7 € succset(i, b, aug-ring(a)). Lemma 6.14, Part 2, applied with e; = c—failbd, eg = band e3 = 2failbd, 
then implies that j € block(k, b — 2failbd, aug-ring(a)), as claimed. 

Process i performs a join-ping at some time in the left-closed, right-open interval [€time(a)—(T,+2d), ltime(a)— 
2d), and i receives responses for all ping messages generated by that join-ping whose destinations do not fail. 
Let a’ be the prefix of a ending just before the join-ping,, s’ = fstate(a’), and R! = global-ring(a’). 

We claim that s’ fingers, contains a finger for k. Since the time of the join-ack, is < £time(a) — (4T, + 10d), 
it is also < ftime(a’) — (3T, + 8d). Since k € succset(i,c — 2failbd, R), Lemma 6.16 implies that k € 
succset(i, c — failbd, aug-ring(a). Therefore, by Lemma 6.19, k € succset(i, c — failbd, R'). Therefore, by 
Lemma 6.16, k € succset(i,c, aug-ring(a’)). Then the inductive hypothesis, Part 1, implies that s’ fingers; 
contains a finger for k. 
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Since k € succset(i,c, aug-ring(a’‘)) and s’.fingers,; contains a finger for k, Lemma 6.17 implies that k € 
block (i, c, s'.local-ring ;). Therefore, during the join-ping, i sends a ping message to k. Since k does not fail in 
a, k responds to the ping message with a block message. Let a” be the prefix of a ending just before & sends 
the block message, let s"’ = state(a"), and let R" denote global-ring(a"’). 


Since 7 € block(k,b — 2failbd, aug-ring(a)), Lemma 6.19 implies that 7 € block(k, b — 2failbd, R'). Then 
Lemma 6.16 implies that 7 € block(k,b — failbd, aug-ring(a")). Then by inductive hypothesis, Part 5, we 
know that s’’. fingers, contains a finger for j with exptime > s".now + T,. — (I, + d). 


Since 7 € block(k,b, aug-ring(a")) and s" fingers, contains a finger for 7, Lemma 6.17 implies that 7 € 
block(k, b, s” .local-ring ;,). Therefore, the finger for j is included in the block sent by & in its block message to 
i. At most T, + 2d time elapses from this send until the end of a, which means that s.fingers, contains a finger 
for j with exptime > s.now + T. — (2T, + 3d), as needed. 


. For Part 3, suppose that s.status; = active and a join-ack, occurs in q@ at a time t > ftime(a) — (T, + 2d). 
Suppose also that j € block(i,b, aug-ring(a)), join-ack; occurs at a time < ftime(a) — (2T, + 5d), and fail; 
does not occur in a. We must show that s. fingers, contains a finger for j with exptime > s.now+T, — (8T,+ 
6d). Without loss of generality, assume that 7 € succset(i, b, aug-ring(a)). 


The argument is similar to that for the previous case, because we argue with respect to pings and block re- 
sponses near the end of the joining protocol. By Lemma 6.15, there exists k € Pld such that k € succset(i,c— 
2failbd, R) — succset(i, 2failbd, R), join-ack;, occurs at a time < ftime(a) — (57, + 12d), and fail, does 
not occur in a. This uses the assumption that c > G6joinbd + 4failbd. Then, since k € succset(i,c — 
failbd, aug-ring(a)), k ¢ succset(i, 2failbd, aug-ring(a)), and j € succset(i, b, aug-ring(a))), Lemma 6.14, 
Part 2, implies that 7 € block(k, b — 2failbd, aug-ring(a)). 

Process i performs a join-ping at some time in the interval [t — (T, + 2d),t — 2d), and i receives responses 
for all ping messages generated by that join-ping whose destinations do not fail in a. Let a’ be the prefix of a 
ending just before the join-ping,, s’ = @state(a’), and R' = global-ring(a’). 

We claim that s’ fingers, contains a finger for k. Since the time of the join-ack;, is < £time(a) — (5T, + 12d), 
it is also < ftime(a’) — (38T, + 8d). Since & is in succset(i, c — 2failbd, R), Lemma 6.16 implies that k € 
succset(i,c — failbd, aug-ring(a). Therefore, by Lemma 6.19, k € succset(i,c — failbd, R'). (This uses 
the assumption that T, > 2T, + 4d.) Therefore, by Lemma 6.16, k € succset(i,c, aug-ring(a’)). Then the 
inductive hypothesis, Parts 1 and 2, imply that s’. fingers, contains a finger for k. 


Since k € succset(i,c, aug-ring(a’‘)) and s’.fingers,; contains a finger for k, Lemma 6.17 implies that k € 
block (i, c, s'.local-ring ;). Therefore, during the join-ping, i sends a ping message to k. Since & does not fail in 
a, k responds to the ping message with a block message. Let a” be the prefix of a ending just before & sends 
the block message, let s"’ = state(a"’), and let R” denote global-ring(a"’). 


Since j € block(k,b — 2failbd, aug-ring(a)), Lemma 6.19 implies that 7 € block(k, b — 2failbd, R'). Then 
Lemma 6.16 implies that 7 € block(k,b — failbd, aug-ring(a")). Then by inductive hypothesis, Part 5, we 
know that s’’. fingers; contains a finger for j with exptime > s" .now+T. — (TI, +d). (Here, we need the fact 
that the time of the join-ack, is < time(a"’) — (3T, + 6d), and the time of the join-ack; is < ftime(a") — d.) 


Since j € block(k,b, aug-ring(a")) and s" fingers, contains a finger for 7, Lemma 6.17 implies that j € 
block(k, b, s” .local-ring,;). Therefore, the finger for 7 is included in the block sent by k in its block message 
to 7. At most 27, + 4d time elapses from this send until the end of a, which means that s. fingers, contains a 
finger for j with exptime > s.now + T. — (3T, + 5d), which suffices. 


. For Part 4, suppose that s.status; = active and a join-ack, occurs in a at atime < €time(a)—(T, +2d). Suppose 
that 7 € block(i, b, aug-ring(a)), join-ack; occurs in a at a time < ¢time(a) — (Ty + 2d), and fail; does not 
occur in a. We must show that s. fingers, contains a finger for 7 with exptime > s.now + T. — (2T, + 2d). 
Without loss of generality, assume that 7 € succset(i, b, aug-ring(a)). 


Lemma 6.15 implies that there exists k € PId such that k € succset(i, c— 2failbd, R) — succset(i, 2failbd, R), 
join-ack, occurs at a time < ftime(a) — (4T, + 10d), and fail, does not occur in a. This uses the assumption 
that c > 5joinbd + Afailbd. 
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At some time in the interval [€time(a) — (I, + d), ftime(a) — T,), k performs a neighbor-refresh, whose 
messages all arrive by the end of a. Let a’ be the prefix of a ending just before this neighbor-refresh,, 
s' = ¢state(a'), and R' = global-ring(a’). 

Since k € succset(i,c — 2failbd, R), Lemma 6.16 implies that k € succset(i,c — failbd, aug-ring(a)), Also, 
since i € R, we know that i € predset(k,c — 2failbd, R) and so, by Lemma 6.16, 1 € predset(k,c — 
failbd, aug-ring(a)). By Lemma 6.19, i € predset(k,c — failbd, R'). Therefore, by Lemma 6.16, i € 
predset(k,c, aug-ring(a')). Then by inductive hypothesis, Part 5, s'.fingers, contains a finger for i with 
exptime > ltime(a') + T. — (T, +d). 

Next, we claim that 7 € block(k, b—2failbd, aug-ring(a)). We know that k € succset(i, c—failbd, aug-ring(a)). 
Also, since k ¢ succset(i, 2failbd, R), we know that k ¢ succset(i, 2failbd, aug-ring(a)). Then, since 
j € succset(i, b, aug-ring(a)), Lemma 6.14 implies that 7 € block(k,b — 2failbd, aug-ring(a)), as claimed. 


Therefore, by Lemma 6.19, j € block(k, b—2failbd, R'). Soby Lemma 6.16, 7 € block(k, b—failbd, aug-ring(a’)). 
Then by inductive hypothesis, Part 5, s'.fingers, contains a finger for j with exptime > ltime(a')+T. — (Ty + 
d). Thus, s’. fingers, contains fingers for i and j, both with exptime > ftime(a’) + T. — (T, +d). 


Since i € block(k,b, aug-ring(a’)) and s'.fingers, contains a finger for i, Lemma 6.17 implies that i € 
block (k, b, s' .local-ring;,). Therefore, i is among the targets of the block message sent by k during the neighbor-refresh,,. 


Also, since j € block(k,b, aug-ring(a’)) and s’ fingers, contains a finger for 7, Lemma 6.17 implies that 
j € block(k, b, s'.local-ring;,). Therefore, the finger for j is included in the block sent by & in its block message 
to i. When the finger is sent, it has exptime > s'.now + T. — (T, + d). Therefore, at the end of a, which is at 
most time T, + d later, s.fingers, contains a finger for j with exptime > s.now +T, — (2T, + 2d), as needed. 


. For Part 5, suppose that s. status; = active and a join-ack, occurs in @ at atime < ftime(a) — (3T, + 6d). Also 
suppose that j € block(i,b6 — failbd, aug-ring(a)), join-ack; occurs in q at a time < ftime(a) — d, and fail; 
does not occur in a. We must show that s. fingers, contains a finger for 7 with exptime > s.now+T, —(T, +d). 
Without loss of generality, assume that j7 € succset(i,b — failbd, aug-ring(a)). Let t denote the time of the 
join-ack;. We consider two cases: 


(a) t < Ltime(a) — (2T, + 3d). 
Lemma 6.15 implies that there exists k € PId such that k € succset(i, c—2failbd, R)—succset(i, 2failbd, R), 
join-ack, occurs at a time < ftime(a) — (5T, + 8d), and fail, occurs in a. Then (as in the argu- 
ment for Part 2), Lemma 6.14, Part 2, implies that 7 € block(k,b — 2failbd, aug-ring(a)). Therefore, 
k € block(j,b — 2failbd, aug-ring(a)). 
Then we claim that k performs a neighbor-refresh sometime in the interval [¢time(a@)—(2T, +2d), ltime(a)— 
(T, + 2d)). Let a’ be the prefix of a ending just before this neighbor-refresh,, let s' = €state(a'), and 
let R’ = global-ring(a’). 
Since j € block(k, b — 2failbd, aug-ring(a)), Lemma 6.19 implies that 7 € block(k,b — 2failbd, R’), 
and so by Lemma 6.16, 7 € block(k, b — failbd, aug-ring(a’)). Also, since k € block(i,c — 2failbd, R), 
we have that i € block(k,c — 2failbd, R), so by Lemma 6.16, i € block(k,c — failbd, aug-ring(a)), so 
by Lemma 6.19, i € block(k,c — failbd, R'), so again by Lemma 6.16, i € block(k,c, aug-ring(a’)), so 
i € block(k, b — failbd, aug-ring(a’)). 
Then by inductive hypothesis, Part 5, s’ fingers, contains a finger for each of i and j, both with exptime > 
s'. now + T. — (T, + d). Since j € block(k,b, aug-ring(a’)) and s' fingers, contains a finger for 
j, Lemma 6.17 implies that 7 € block(k,b, s'.local-ring,,). Therefore, j is among the targets of the 
block message sent by k during the neighbor-refresh,. Also, since i € block(k,b, aug-ring(a')) and 
s' fingers, contains a finger for 1, Lemma 6.17 implies that i € block(k, b, s'.local-ring,,). Therefore, the 
finger for 7 is included in the block sent by k in its block message to j. When the finger is sent, it has 
exptime > s'.now +T. — (T, +d). 
This block message arrives at j at atime < time(a)—(T, +d). Then sometime in the interval [€time(a) — 
(T, + d),€time(a) — Tj), j performs a neighbor-refresh;. Let a" be the prefix of a ending just before 
this neighbor-refresh ;, let s"" = €state(a"’), and let R" = global-ring(a). 
Since j € block(i, b—failbd, aug-ring(a)), we have, by Lemma 6.19, that i € block(j, b—failbd, global-ring (a'")). 
So by Lemma 6.16, 7 € block(j, b, aug-ring(a"')). Also, s’. fingers; contains a finger for 7, because the 
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finger for 7 that arrives in the block message from k has not had time to expire. Then Lemma 6.17 implies 
that i € block(j, b, s"'.local-ring ;). Therefore, 7 is among the targets of the block message sent by j during 
this neighbor-refresh ;. 

This block message contains a finger for j, with exptime = s"’.now + T.. Therefore, at the end of a, 
at most time T, + d later, s.fingers, contains a finger for j with exptime > s.now + T. — (T, + d), as 
needed. 

t > ltime(a) — (2T, + 3d). 

Then the time between the join-ack; and join-ack, is > Ty + 3d. 

Lemma 6.15 implies that there exists k € PId such that k € predset(j,c—2failbd, R)—predset(j, 2failbd, R), 
join-ack,, occurs at a time < time(a) — (6T, + 13d), and fail, occurs in a. This uses the assumption that 
c > Tjoinbd + Afailbd. 

Now we claim that i € block(k,b — 3failbd, aug-ring(a)). Since k € predset(j,c — 2failbd, R), 
Lemma 6.16 implies that k € predset(j,c — failbd, aug-ring(a)). Since k ¢ predset(j, 2failbd, R), we 
have that k ¢ predset(j, 2failbd, aug-ring(a)). Since 7 € block(i, b— failbd, aug-ring(a)), Lemma 6.14, 
Part 2, implies that 7 € block(k,b — 3failbd, aug-ring(a)), as claimed. 

Process j performs a join-ping at some time in the interval [¢ — (1, + 2d),t —T,), and j receives responses 
for all ping messages generated by that join-ping whose destinations do not fail, strictly before time t. Let 
a’ be the prefix of a ending just before this join-ping;, s’ = ¢state(a’), and R' = global-ring(a’). 
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We claim that s’. fingers ; contains a finger for k. Since k € block(j, c—failbd, aug-ring(a)), Lemma 6.19 
implies that k € block(j,c — failbd, R'), and so by Lemma 6.16, k € block(j,c, aug-ring(a’)). Then by 
inductive hypothesis, Parts 1 and 2, s’. fingers ; contains a finger for k. 

Since k € block(j,c, aug-ring(a’)) and s' fingers; contains a finger for k, Lemma 6.17 implies that 
k € block(j, c, s'.local-ring ;). Therefore, during the join-ping, j sends a ping message to k. Since k does 
not fail, it responds with a block message. Let a” be the prefix of a ending just before k sends this block 
message, let s” = state(a"), and R” = global-ring(a")). 

Since i € block(k,b — 2failbd, aug-ring(a)), Lemma 6.19 implies that i € block(k,b — 2failbd, R'). 
Then Lemma 6.16 implies that i € block(k, b — failbd, aug-ring(a")). Then by inductive hypothesis, Part 
5, we know that s” fingers, contains a finger for i with exptime > now + T. — (T, + d). 

Since i € block(k, 6, aug-ring(a"’)) and s"’ fingers, contains a finger for 1, Lemma 6.17 implies that 
i € block(k, b, s" local-ring,,). Therefore, the finger for 7 is included in the block sent by & in its block 
message to j. This finger is recorded by j, and persists until the end of a. 

Immediately after the join-ack;, and at intervals of T, thereafter, process j performs a neighbor-refresh ,, 
in which it sends a block message containing a finger for itself with exptime = T.. 

We claim that 7 is included in the set of targets of each such block message. This is because i € block(j, b— 
failbd , aug-ring(a)), so by Lemma 6.16, i € block (j,b, aug-ring) at each point after the join-ack;. Then 
Lemma 6.17 implies that i € block(j, b, local-ring ;) at each point after the join-ack;, which implies that 7 
is included in the set of targets of each such block message. 

Some such message must arrive at 7 that is sent by j ata time > €time(a) —(T, +d). Therefore, s. fingers, 
contains a finger for j with exptime > s.now + T. — (T, + d), as needed. 


6.5 Maintaining the Chords 


We state a lemma analogous to the main lemma of the previous section, Lemma 6.20, but for neighbors of each 
particular chord position « rather than neighbors of the node 2 itself. 

The statements of Part 1, 2, and 3 are entirely analogous to those in Lemma 6.20. However, in Part 4, the fact 
that ¢ uses chord-pings instead of neighbor-refreshes to keep up-to-date with respect to x after the join-ack; changes 
the bound slightly. Part 5, which describes situations where 7 obtains first-hand knowledge of 7 directly from 7, gets 
weakened considerably. This is because we have no phenomenon analogous to that of the prior case 5(b), where 7 
informs 2 directly about its existence immediately after the join. So, the new Part 5 talks only about those 7 that are so 
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close to the chord position that 2 pings 7 directly during its chord-pings. Since 7 pings only the apparent c-block of z, 
this involves only those 7 that are in this tiny neighborhood. 

The proof is also different in some interesting ways. Rather than relying on the inductive hypotheses as before, we 
rely on the earlier lemma about neighborhoods, Lemma 6.20. That is because the relevant information arrives from 
neighbors of the chord position 2. 


Lemma 6.21. Let a be a good finite execution, s = ¢state(a). Let i be a process that does not fail in a. Let 
kEN,O <k<n—1,and x = PToX(i) + 2". Then: 


1. Suppose that s.status; = joining and a receive(lookup-comp),,; event for target x occurs in @ at a time 
> ltime(a) — (Ty + 2d). Suppose that j € block(x,c, aug-ring(a)), join-ack; occurs in a at a time < 
ltime(a) — (3T, + 8d), and fail; does not occur in a. 

Then fingers, contains a finger for j with exptime > now. 


2. Suppose that status; = joining and a receive(lookup-comp),,; event for target x occurs in a at a time < 
ftime(a) — (Ty +2d). Suppose that j € block(x,b, aug-ring(a)), join-ack; occurs in a ata time < ltime(a) — 
(T, + 3d), and fail; does not occur in a. 

Then fingers, contains a finger for j with exptime > now + T. — (2T, + 3d). 


3. Suppose that s.status; = active and a join-ack, occurs in a at a time > ttime(a) — (Ty + 2d). Suppose that 
j € block(x, b, aug-ring(a)), join-ack; occurs in a ata time < ttime(a) — (2T, + 5d), and fail; does not 
occur in a. 

Then s. fingers, contains a finger for j with exptime > s.now + T. — (8T, + 6d). 


4. Suppose that s.status; = active and a join-ack, occurs in a at a time < time(a) — (I, + 2d). Suppose that 
j € block(a, b, aug-ring(a)), join-ack; occurs in a ata time < ltime(a) — (Ty + 3d), and fail; does not occur 
in a. 
Then s. fingers, contains a finger for j with exptime > s.now + T. — (2T, + 3d). 


5. Suppose that s.status; = active and a join-ack, occurs in a at a time < ltime(a) — (2T, + 4d). Suppose that 
j € block(ax,¢— failbd, aug-ring(a)), join-ack; occurs in a ata time < ltime(a) — (2T, + 5d), and fail; does 
not occur in a. 

Then s. fingers, contains a finger for j with exptime > s.now + T. — (T, + 2d). 


Proof. Parts 1, 2, and 3, are proved similarly to before, but instead of inductive hypotheses, they use the relevant 
parts of Lemma 6.20. 

For Part 4, we rely on the chord-ping mechanism. And again, the relevant parts of Lemma 6.20 rather than 
inductive hypotheses. 

For Part 5, we use Part 4 to conclude that i learns about j by time ftime(a) — (I, + 2d), and then rely on the 
chord-ping mechanism. The key is that in this last chord-ping, 2 communicates directly with (pings) 7. 


6.6 Correctness of Lookup Results 


Theorem 6.22. Every good execution a satisfies 2T., + 6d-lookup-correctness. 


Proof. (Sketch:) Let a’ be a prefix of a ending just before a lookup-ack( 7); event, which is a response to a prior 
lookup(x);. Let s’ = €state(a’) and R’ = global-ring(a’). 

It suffices to produce a ring R such that R C aug-ring(a’), R contains every XId in R' except possibly for those 
j such that join-ack,; occurs in a’ ata time > ftime(a') — (2T, + 6d), and H = ppredset(x,c, R). 

Define the ring R to be the union S U T, where: 


e Sis the set of all PToX (j) € R' such that join-ack; occurs at a time < ¢time(a') — (2T, + 6d). 


e T is that set of all X/Jds in s' fingers. 
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We show that R satisfies the three properties. 
The first property is immediate, because all XJds in s'. fingers, are in aug-ring(a’). The second property is also 
immediate, because S C R. For the third property, the code for lookup-ack(H); implies that H = ppredset (x, c, s' .local-ring,). 
We need to show that H = ppredset(a,c, R). Since local-ring, is the set of X/ds in fingers,, and that set is a subset 
of R, it is enough to show that every 7 € ppredset (zx, c, R) is also in s' .fingers,. 
So, fix j € ppredset(x,c, R). If 7 € T then we are done so assume that j € S. Thus, j € R’ and join-ack; occurs 
at atime < t — (2T, + 6d). Since j € ppredset(x,c, R), we have that j € ppredset(x,c + 3joinbd, R’). 
The lookup-ack( 1); event follows the receipt by i of a lookup-comp message, with no intervening time passage. 
Let k& be the sender of this lookup-comp message. Then k sent this message at some time > ftime(a’) — d. Let a” be 
the prefix of a ending just before k composed this message, s’’ = fstate(a’’), and R" = global-ring(a''). 
We claim that k € ppredset(x,c + 3joinbd, R"); the argument is like one in Lemma 6.20, Part 1. 
Since 7 € ppredset(x,c + 3joinbd, R'), it follows that 7 € ppredset(x,c + 3joinbd + failbd, R'). Since j € 
ppredset(x,c + 3joinbd + failbd, R") and k € ppredset(x,c + 3joinbd, R"), it follows that 7 € block(k,c + 
3joinbd + failbd, R"). Since b > c + 3joinbd + 2failbd, we have that 7 € block(k,b — failbd, R'’). Therefore, 
4g € block(k, b, aug-ring(a"")). 
By Lemma 6.20, Parts 3 and 4, s" fingers, contains a finger for j with exptime > s" .now + T, — (3T, + 6d). 
This finger for 7 gets included in the block sent by k in the lookup-comp message. After 7 receives this message, 
fingers, contains a finger for j with exptime > now + T. — (37, + 7d). Then, since T, > 37, + 7d, s'.fingers, 
contains a finger for 7. This is what we needed to show. 


6.7 Latency Bounds 
6.7.1 Latency of a request 


Theorem 6.23. Suppose that a is a good execution, a’ a finite prefix of a containing at least 2c + 1 join-ack events. 
Suppose that: 


1. The final step of a! is a lookup, step in which ¢ initiates request r, with target «x. 
2. No other requests (on behalf of joins, client lookups, or stabilizes) are active at any time > ¢time(a') — Te. 
Then request r terminates with a receive(lookup-comp) step, at a time that is < €time(a') + 4(log N + 1)d. 


Proof. (Sketch:) We first claim that, at any point during the lookup, for any process ¥ 7 in the ring, the known 
predecessors of the target x are “bunched together” in at most two c-blocks in the actual global ring. One of these is 
the block of actual predecessors of x in the ring, and the other may be anywhere else. 


Claim 6.24. At any point in a after a', and for any j # 1, all processes in ppredset(x,c, local-ring ;) that have 
not failed lie within two c-blocks of consecutive processes in global-ring: ppredset(x, c, global-ring) and one other 
c-block. 


Proof. Everyone except i keeps only its neighborhood and chord fingers, as specified by the underlying infrastruc- 
ture. These have the needed property. (Two blocks can arise if the target x is in the middle of one of 7’s blocks.) 


Claim 6.25. At any point in a after a’, and before a actreceive(lookup-comp); event, all processes in ppredset(x, c— 
Afailbd, local-ring ;) that have not failed lie within two c-blocks of consecutive processes in global-ring: ppredset(x, c, global-ring) 
and one other c-block. 


Proof. (Sketch:) This is more complicated than the previous claim, because process 7 acquires fingers from other 
nodes’ tables in the course of the lookup. 

The ways in which process 7 acquires new fingers are somewhat constrained: by normal neighborhood and chord 
refreshing, by receiving a lookup-resp message or by receiving a lookup-comp message. We rule out the last case by 
assumption—we are considering only what happens before the first receive(lookup-comp); happens. 

Thus, whenever 7 acquires new fingers, it acquires an entire block of size at least c from some other node, which 
by the previous claim is included in only two c-blocks in the actual global ring at the time the block was sent, one of 
these blocks being ppredset (x, c, global-ring). 
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Since at most failbd of each of these blocks could have failed before the block was sent, and at most another failbd 
from each of these blocks could fail after the send and up to the point of reference, it must be that at least c — 4failbd 
of the newly-arrived fingers do not fail by the point of reference and lie within two c blocks in global-ring, with one 
of these blocks being ppredset (x, c, global-ring). 

But this doesn’t quite tell us that all processes in ppredset(x, c—4failbd, local-ring ,) that have not failed lie within 
these two c-blocks of consecutive processes in global-ring. For this, we have to use the fact that the blocks in fingers, 
that are closest to z don’t “degrade” by having too many processes fail. The reason this doesn’t happen is that 7 keeps 
moving the algorithm along—pinging “enough” nodes among its closest predecessors for x, and receiving responses 
from many of them, which provide information about blocks that are still closer to x. 


Now the key claim describes how the “distance” to the destination x is halved every time 4d, until near the end of 
the lookup: 


Claim 6.26. Let e be a power of two, e < N. 

Suppose that, at some point during the lookup, the clockwise distance from pred(a, c — 4failbd, local-ring,) to x (in 
the identifier space) is < e. 

Then by time Ad later, at least one of the following holds: 


1. The lookup ends (with the receipt of a lookup-comp message). 
2. fingers, contains at least c — 2failbd of the members of ppredset(x, c, global-ring). 
3. The clockwise distance from pred(x, c — 4failbd, local-ring,) to x is < e/2. 


Proof. (of Claim:) Assume that the lookup doesn’t end within time 4d, that is, Case 1 doesn’t hold. Then within 
time 2d, process 7 performs a new join-ping, which results, within an additional time 2d, in a response from one of the 
processes corresponding to the XJds in the assumed ppredset (x, c — 4failbd, local-ring,). (The fact that one responds 
depends on the fact that not all of these processes can have failed recently or fail during the ping-response exchange. 
This in turn relies on our assumed bound on failure rate, and the assumption that they are all within two c-blocks in 
the global ring.) 

Let j be such a responding process. If PToX (j) +1 = 2, that is, x is the immediate successor of 7 in the XId 
space, then j sends a lookup-comp message, contradicting the fact that Case | doesn’t hold. So, we may assume that 
x is not the immediate successor of j in the X/d space. 

Then choose k to be the largest natural number such that PToX (j) + 2* € (PToX(j), 2), that is, the largest 
power-of-two successor of 7 that does not reach z. 

The response from 7 to 7 contains a set F of fingers representing j’s c best predecessors for x at the time 7 sends 
its response. There are two cases: 


1. F contains only elements in the open interval (PToX (j) + 2*, x). That is, only elements after the given largest 
power-of-two successor of 7. 


In this case, after i receives the message, the clockwise distance from pred(x,c — 2failbd, local-ring,) to x is 
< e/2, which suffices to satisfy Case 3. 


2. F contains at least one element that is not in the open interval (PToX (j) + 2", 2). 


Lemma 6.21 implies that, when j sends the lookup-response message, fingers ; contains entries for all elements 
of block(j + 2*,b, augmented-ring) that have not failed. Since the set F contains at least one element that 
is not in the open interval (j + 2", 2), we claim that F contains actual predecessors of x in the global ring, 
specifically, F' contains at least c — failbd of the members of ppredset(x, c, global-ring) at the time j sends the 
message. (Up to failbd of the fingers in F’ could have already failed at the time of the send.) Just after 7 receives 
the message, fingers, contains at least c — 2failbd of the members of ppredset(x,c, global-ring). This yields 
Case 2. 


To complete the proof, we use the last claim repeatedly, as long as Case 3 holds. Since we cannot keep halving 
forever, eventually, either Case 1 or Case 2 arises. If Case | arises first, then we are done. On the other hand, if Case 
2 arises first, then within only one more ping round, 7 receives a lookup-comp message, so again we are done. 
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7 Appendix B: Using Nondeterministic Assumptions 


As described in Section 3) and Appendix B our analysis in this paper is based on deterministic assumptions. In general, 
we assume that there are at most v relevant events that occur in an “arc” of the ring containing at most r processes 
during a time interval A. 

These assumptions however are not realistic for many distributed environments. In practice join and failure events 
are modeled by probability distribution functions (e.g, Poisson) which makes it impossible to put a deterministic bound 
of the number of such events during an interval of time A. 

To establish a relationship between the more realistic probabilistic assumptions and the deterministic assumptions 
next we compute the mean time 7’, between two violations of the deterministic bounds under the probabilistic as- 
sumptions. In other words, TJ’ represents the expected time for which a MultiChord will remain in the quasi-ideal 
state. 

For tractability, we assume a system in which processes join according to a Poisson process with arrival rate \q 
and that the process lifetimes are exponentially distributed with a mean of J. Assuming that the MultiChord ring is in 
steady state we have 1 = N/Xq, i.e., the rate of joins is equal to the rate of failures or leaves. Thus, the rate of changes 
is A = 2Aq. 

Next, we bound the probability that the deterministic assumption—that no more than v relevant events occur during 
a time interval A in an arc of the ring of r processes—is violated. 

The average number of events that occur in a given arc of the ring consisting of r processes during an interval of 
time A is 


w=A-Xr/N), (2) 


where A - A represents the average number of events that occur in the entire system during a time interval A, and 
(r/N) represents the fraction of these events that occur during that portion of the ring. 
Because events are generated from a Poisson distribution we can apply the Chernoff bound: 


Pr(X > (1+6)p) < HO, (3) 


where Pr(X > (1+ 6)y) represents the probability that no more than (1 + 6) events occur in a given arc of r 
processes during a time interval A. Taking v = (1 + 6), the probability that the deterministic bound is violated in a 
given arc of r processors during a time interval A is 


_ @=n)? 


Pr(X>v)<e #. (4) 


The probability p(A, 1) that the deterministic bound is violated in any arc of r processors during an interval A is 
bounded above by 


_ @=pn)? 


p(A,r) << NPr(X >v)< Ne. (5) 


Then the mean time 7’ between two violations of the deterministic bound is 


A A we? 


Tr = —e hm” 6 
f“7an 7 Ne” o 
Expanding p yields 
v—NA-r)? 
Ty > Saar (7) 


where \ = /N represents the normalized rate of change. 
Next, let us consider how do deterministic constraints presented in Section 3 map to Ineq. (7). In particular, we 
consider the following constraints: 
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Te > ST; (8) 
c > Tjoinbd+4failbd 
b > 2¢+3joinbd + max(2joinbd, failbd), (9) 


where failbd represents the number of failures in an arc of b + 1 processes during time T,, and joinbd represents the 
number of joins in an arc of 6 + 1 processes during time T;. 

Because we assume steady state, the number of failures and joins in an arc of b + 1 processes is roughly the same 
during a given time interval. This means that failbd = joinbdT./T;. If we take T. /T; = 5, the last two constraints 
in Inegs. (8) become: 


c > 27joinbd (10) 
b > 6ljoinbd 


during an interval of time 7., and 


c > 5Ajoinbd (11) 
b > 12.2jo0inbd 


during an interval of time T;. 

Since constraints (11) imply constraints (10) next we consider only constraints (11). Let us take c = 6joinbd, 
b = 13joind, values which satisfy both these constraints. 

Finally, we take r = b+ 1, A = Tj, and v = 2joinbd (the factor of 2 is because v accounts for both joins and 
failures during the interval T;). With these values, the expected time before the deterministic constraints are violated 
(see Ineq (7)) becomes 


(c/3-XT; (b+1))? 


ES e COR, (12) 


where b > 13c/6. 
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